In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #data visualization
import seaborn as sns #for much better data visualization

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Ask Phase** #
**Problem Statement**
- To understand how smart devices are used in today's smart device market

However, the provided problem statement is too generic, so we shall make it more specific. Since Bellabeat manufactures health-focused smart products, the activity patterns of their users would be of interest to them. Therefore, the newly crafted problem statement is as follows:
- To understand the activity pattern of smart devices users in the market today.

**Below are questions that shall guide the analysis for Bellabeat:**
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat customers?
3. How could these trends help influence Bellabeat marketing strategy?

# **Prepare Phase** # 
The used for the analysis is provided by FitBit, made available through Mobius to the public. Since the data was not collected directly by Bellabeat, it shall be classified as External Data.

***Note:** Sample size of 30 is too small to properly to represent the entire user base of FitBit. Further analysis to be performed when access to additional smart device data is available*

**Let's proceed to import the dailyActivity Dataset to better understand the data that is being studied in the case study**

In [None]:
#Importing the dailyActivity Dataset
dailyActivity_df = pd.read_csv("../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
dailyActivity_df

As seen from above dataframe, the data is stored in a long format.

**Let's also get some information on our dataset**

In [None]:
dailyActivity_df.info()

Based on the information generated, there are a total of 940 data points and all of which are Non-Null Values. However, noticed how the ActivityDate is defined as an object datatype. Let's change it to use Pandas Datetime Object

In [None]:
dailyActivity_df['ActivityDate']=pd.to_datetime(dailyActivity_df['ActivityDate'])
dailyActivity_df.info()

**Before proceeding, let's evaluate the data based on the ROCCC criteria introduced during the Google Data Analytics Course:**
* *Reliability:* Not reliable due to the small size of 30 being not sufficient to represent the overall population of FitBit Smart Devices users & we have no information on the demographic of the people who participated in the study (eg. Were the participant mainly women or was there a mixture of men & women? Age? etc)
* *Original:* The datasets are deemed original since we know which organization had prepared the data (FitBit)
* *Comprehensive:* Based on the initial screening of the dataset, it is quite comprehensive to the user's activity.
* *Current:* Dataset is not considered current since it was collected in 2016. Advancement in technology within the span of 5 years could have changed how FitBit users, by extension Smart Device Users, use their devices.
* *Cited:* No information was provided whether other organizations has cited the data prepared by FitBit.

As there are no gender type specified in the dataset, we are unable to filter out data that is relevant to women (which is the audience of interest for Bellabeat). Since there are no null values present in the dataset, we are ready to move on to the Process Phase.

# **Process Phase**

For the process phase, the tool selected for analysis would be Python. Was supposed to use R but that flagged a warning that all the content thus far would be erased thus the decision to proceed with Python was made. However, concepts from the Google Data Analytics Course shall be applied still.

Programming approach was selected due to the size of the dataset and that writing out code can help to visualize the analysis logic / methodology in my opinion.

To verify that the dataset is clean for analysis, a final check to see if there are any missing values present within the dataset as illustrated below in the code chunks.



In [None]:
#Check for missing values with isNA()
print(dailyActivity_df.isna().nunique())
dailyActivity_df.isna()

Based on the above check, each column is found to 1 unique value and from the dataframe it is clear that the unique value is False. Since isna() returned False for each data point, it confirms that there are no missing values within the dataset. 

With that check done, time to move on to the Analyze Phase.

# **Analyze Phase** 
In this phase,we shall perform so explolatory data analysis by generating some data visualization to better understand and draw insights to our data. This section will also effectively cover the "Share" Phase of the Google Data Analytics Framework since this entire notebook can be viewed as the document that will be shared with others. 

As the goal of this analysis is to understand & spot any trends in users'activities, let's first get a statistical description of the data using the describe() method provided by Pandas 

In [None]:
dailyActivity_df.describe()

Based on the description provided, it can be seen that both Total Distance & Tracker Distance are actually duplicate measurements due to their statistical information being almost the same.

Looking at the timing data recorded (VeryActiveMinutes, FairlyActiveMinutes etc), it is hard to make sense of all these values. To better understand how active the users are, its best to compare the minutes the user spend being active against the total time the FitBit is worn by the users. We will have to perform some calculation using the data frame in order to achieve that goal.

In [None]:
#Creating a column within the Dataframe to record Active Time, Total Time & Proportion Of Time Being Active (%)
dailyActivity_df['ActiveTime']=dailyActivity_df['VeryActiveMinutes']+dailyActivity_df['FairlyActiveMinutes']+dailyActivity_df['LightlyActiveMinutes']
dailyActivity_df['TotalTime']=dailyActivity_df['VeryActiveMinutes']+dailyActivity_df['FairlyActiveMinutes']+dailyActivity_df['LightlyActiveMinutes']+dailyActivity_df['SedentaryMinutes']
dailyActivity_df['ProportionOfTimeActive%'] = dailyActivity_df['ActiveTime'] / dailyActivity_df['TotalTime'] * 100
dailyActivity_df.head()

Let's have a look at the statistical data of the Proportion Of Time Active % as well as visualize its distribution as well

In [None]:
#Statistical Data for ProportionOfTimeActive%
dailyActivity_df.describe()

In [None]:
#Visualizing the Statistical Distribution using Seaborn
sns.set_theme()
fig_dims = (10, 8)
fig, ax = plt.subplots(figsize=fig_dims)
sns.kdeplot(data=dailyActivity_df['ProportionOfTimeActive%'])

**Conclusion:** As seen from the distribution plot, on average, users spend approximately 20% of the time being active while wearing their FitBit. Based on the statistical data, the average total time which user wear their FitBit is 1219 mins and the average time a user spend being active while wearing their FitBit is 228 mins.

**Relationship Between Active Time, Total Distance & Calories**

It is a common believe that the more active a person is, the more calories they burn.Additionally, we would also like to see how Total Distance relates to Active Time and Calories as well. 

**The hypothesis is that:**
* The greater the Total Distance recorded, the greater the Active Time of the User
* The greater the Active Time, the more Calories a User will consume
* Therefore, Total Distance, Active and Calories is hypothesize to have a positive correlation with each other

Let us see if that's what the data shows us.

In [None]:
#Extract the columns of interest into a separate dataframe
selectedColumn = [dailyActivity_df['TotalDistance'], dailyActivity_df['ActiveTime'], dailyActivity_df['Calories']]
fitness_df = pd.DataFrame(selectedColumn).T
fitness_df

In [None]:
#Plotting the Total Distance against Active Time
fig_dims = (10, 8)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot(x=fitness_df['ActiveTime'], y=fitness_df['TotalDistance'], hue=fitness_df['Calories'], size=fitness_df['Calories'], sizes=(10,150)).set_title('Total Distance against Active Time')

In [None]:
#Plotting the Total Distance against Active Time
fig_dims = (10, 8)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot(x=fitness_df['ActiveTime'], y=fitness_df['Calories'], hue=fitness_df['TotalDistance'], size=fitness_df['TotalDistance'], sizes=(10,150)).set_title('Calories against Active Time')

In [None]:
#Generating a Correlation Matrix Plot
heatmap = sns.heatmap(fitness_df.corr(), vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12)

**Conclusion:**

Based on the 2 plots above, it can be seen that there is a positive correlation between Total Distance, Active Time & Calories. This is further validated by the generated Correlation Plot between the 3 variables.
However, the values of the Correlation Plot suggest that there is a relatively strong correlation between Total Distance-Active Time & Total Distance-Calories. The positive correlation between Active Time & Calories is weak base on this dataset & thus more research has to be done