## Background 
The following case study is part of the Google Data Analytics course.
We are required to solve a fictional case study for the company Bellabeat which is a manufacturer of health-focused products for women. The Möbius dataset was recommended in order to gain market insights about user behavior in the area of wearable fitness trackers. 

I have decided to use Python since I have been learning Python for about half a year and want to grow my skills further. Looking forward to any helpful feedback, especially for shortcuts for some of my lengthy lines of codes.

## Business Task

* Gain insights about how consumers are using their smart devices in order to derive recommendations for Bellabeat's own marketing strategy in terms of new growth opportunities.
* The following analysis should in particular help drive the growth of Bellabeat's wellness watch which tracks user activity, sleep and stress by revealing insights from a third party dataset 

The following **key stakeholders** are involved in the project:
* Urška Sršen, co-founder and Chief Creative Officer (sponsor)
* Sando Mur, co-founder and mathematician
* other members of the executive team
* the marketing analytics team

## Data source and data integrity

In order to answer the business task, the FitBit Fitness Tracker Data dataset will be used which has been made available by Möbius here on kaggle. The dataset has been released under a public domain license (CC0) which allows even commercial use.

The Möbius dataset links to the original dataset at Zenodo:
URL: https://zenodo.org/record/53894#.YJKvDC-22dp
The data has been generated by a survey via Amazon Mechanical Turk between March 12 and May 12, 2016. However the Möbius dataset only contains the second half of original datset spanning from April 12 - May 12.
Amazon Mechanical Turk is a platform that allows businesses to connect to workers for on-demand tasks. Therefore it can be assumed that the sponsor of this survey asked FitBit users to provide their data and later aggregated this data. 
According to Moss and Litman (URL: https://www.cloudresearch.com/resources/blog/who-uses-amazon-mturk-2020-demographics/ ) the population of Mechanical Turk shows the following **biases**:
* Age: significantly younger than the US population
* Gender: imbalance towards more female (57%)
* Ethnics: Blacks are underrepresented and Asians are overrepresented

The dataset doesn't contain any further information how the sample set has been defined. Could anybody take part in the survey or have certain selection criteria been defined? Participants probably have used different FitBit products. If only male users had joined, findings might be useless for Bellabeat which only sells products for female customers. Which instructions have the participants received? Did the temperature/weather during the survey have an impact on the user's activity? From which geographical areas have participants been recruited? How well did the participants comply with the instructions of the principal? 

Due to the following reasons, we whould interpret the data with huge **caution**:
* Lack of transparency in terms of data collection
* Outdated data (the survey was in 2016) might be the wrong choice for developing a new product in 2021
* Sample size: Only 33 users and short observation period of one (Möbius dataset) respectively two months only (original dataset)
 
The **data integrity** is clearly not sufficient to provide reliable insights to Bellabeat. The following analysis therefore can only offer first hints and directions which should be verified through analysis of a larger and more reliable dataset.

## Description of Möbius dataset

The dataset contains personal tracker data of 30 Fitbit users (the later analysis will show that the dataset actually contains 33 users). Due to anonymized IDs the data cannot be associated with the real username anymore and therefore protects the privacy of the users.

The dataset consists of 18 csv files organized in long data format which means the each row is one observation per user. Most users have data in multiple rows. 

Möbius offers a preprocessed dataset. The previous manipulations are, however, not transparent.

## Research questions

In order to have an initial direction for the analysis, I would like to verify the following hypotheses:
* People exercise more often at the weekend.
* People mostly exercise after work.
* People who sleep less than 6 hours tend to exercise less than the average.
* Exercise correlates with a reduced time to fall asleep.
* People are lazy in terms of manually adjusting or inputting information.
* People who perform more high-intensity workouts (MET>6) burn more calories in average
* Many people don't wear the device at night (due to charging, discomfort, etc.)

More insights should be discovered by detecting outliers, analyzing relationships among different features, etc.

## Selected Data

I have decided to focus on the following files during my analysis: 
* daily_activity which already combines data from various other files
* hourly_calories
* sleep_day 
* weight_log_info

Let's load the libraries and csv files to have a first look at it.

## Data cleaning and analysis

In [None]:
#load libraries
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Let's load a few relevant files which might be related to Bellabeat's wellness watch and have a first look at the daily_activity dataframe.

In [None]:
daily_activity = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
weight_info = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')
sleep_day = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
hourly_calories = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyCalories_merged.csv')

### Daily Activity

Let's start the analysis with the daily_activity csv file.

In [None]:
daily_activity.head()

### Units

Unfortunately we don't know whether the distance is measured in km or miles which is affected by the device settings.  If all distances were recorded in either km or miles, there wouldn't be any problem since the ratio between a measurement and the reference point are independent from what actual units are used. 
We could try to check if similar amounts of total steps equal similar amounts of calories. But since burnt calories are affected by gender, age, weight and other factors, which we don't know, it's impossible to analyze. 

Calories seem to refer to total daily calories burned incl. those at rest. 
The distinction between sedentary, lightly, fairly and very active will be explained in the following. 

### Contextual information - MET

Before heading over to the actual analysis, I would like to provide some contextual information about MET, which will affect the interpretation of different activity intensities later:
**MET** stands for **metabolic equivalents**. 1 MET is defined as the energy consumed by your body while sitting at rest. The higher the MET, the more intense the exercise.
According to FitBit, the activity data has been split into four categories: sedentary (>1.5 METs), light (1,5-3 METs), fairly active (3-6 METs) and very active (>6 METs). Walking roughly corresponds to 3 METs (URL: https://community.fitbit.com/t5/Web-API-Development/Daily-Activity-Summary-Data-Definition-Questions/td-p/3087077, https://help.fitbit.com/articles/en_US/Help_article/1379.htm)


### Daily activity dataframe

In [None]:
daily_activity.info()

We have in total 15 different columns and 940 entries. The columns contain object data types, integers or floating point numbers. Let's convert the string (object) ActivityDate into a Pandas date time object and add a separate column for the day of the week. After that let's check how many unique users we have.

In [None]:
#create a new column 'ActivityDate_new' based on 'ActivityDate'
daily_activity['ActivityDate_new'] = daily_activity['ActivityDate']
#convert the object data type into a Pandas date time object
daily_activity['ActivityDate_new']= pd.to_datetime(daily_activity['ActivityDate_new'])
#move the new column to position 1 and delete old column ActivityDate
ActivityDate_new = daily_activity['ActivityDate_new']
daily_activity.drop(labels=['ActivityDate_new'], axis=1, inplace = True)
daily_activity.drop(labels=['ActivityDate'], axis=1, inplace = True)
daily_activity.insert(1, 'ActivityDate_new', ActivityDate_new)
#create a new column for the day of the week
#this assigns a numerical value to each day of the week, starting with a 0 for Monday
daily_activity['day_of_week'] = pd.to_datetime(daily_activity['ActivityDate_new']).dt.dayofweek
#let's also assign a string to each day of the week
daily_activity['day_of_week_str'] = daily_activity['ActivityDate_new'].dt.strftime('%A')
#move both columns to the beginning 
day_of_week = daily_activity['day_of_week']
day_of_week_str = daily_activity['day_of_week_str']
daily_activity.drop(labels=['day_of_week'], axis=1, inplace = True)
daily_activity.drop(labels=['day_of_week_str'], axis=1, inplace = True)
daily_activity.insert(2, 'day_of_week', day_of_week)
daily_activity.insert(3, 'day_of_week_str', day_of_week_str)
daily_activity

In [None]:
#check number of unique users
len(daily_activity['Id'].unique())

In total 33 unique users have participated in the survey from April 12 till May 12, 2016. That should add up to 33x31 = 1023 maximum possible values on a daily basis. However only 940 entries were provided. It's unclear if due to preprocessing certain observations were already deleted.

We can also tell from the dataframe.info() function that the daily_activity dataframe doesn't contain any null values (missing values). However let's run some checks to see if specific rows contain 0 values for the steps column. This would indicate the wearable was not worn on a specific day (charging, forgotten to put on, intentiously not worn, technical issue, etc.).  

In [None]:
daily_activity[daily_activity['TotalSteps']==0]

* Maximum possible daily observations: 1023 (33 unique users x 31 observation days)
* Actually provided daily observations: 940 (83 observations missing due to unknown reasons: preprossessing, Mechanical Turk data collection issues, etc.
* Number of observations where TotalDistance=0: 77 (8,2%)
* If we exclude any data collection issues, this would mean that in 15,6% of the time the device hasn't been worn (160/1023). But it's all speculation and originates from the nature of third party data. 

Let's check how many users were concerned:

In [None]:
daily_no_activity = daily_activity[['Id']][daily_activity['TotalSteps']==0]
daily_no_activity['Id']
daily_no_activity['Id'].nunique()

Let's have a closer look at user 18844505072 

In [None]:
daily_activity[daily_activity['Id']==1844505072]

User 1844505072 for example didn't record any data from April 24-26 (3 days), May 2nd (1 day) and May 7-12 (6 days). This user also shows relatively low values for certain days. Unless the user has been hospitalized, sits in a wheelchair or the device has any issues, the TotalSteps cannot equal 0. Let's filter once more for a low value of TotalSteps but this time increase the threshold to 100 steps which roughly equals 76 meters. Even people who relax at home and only walk to the toilet or fridge a few times should have more than 100 steps daily if they wear a fully functional fitness tracking device. 

In [None]:
daily_activity[daily_activity['TotalSteps']<100]

For the following analysis all rows where TotalDistance < 100  should be dropped since it would distort the daily averages. My assumption behind is that either the watch hasn't been worn properly or there were other data collection problems.

In [None]:
daily_activity.drop(daily_activity[daily_activity['TotalSteps']<100].index, inplace=True)

Let's have a look at some basic statistical details with the df.describe() function. Especially the min max values can be relevant to identify outliers.

In [None]:
daily_activity.describe()

* Another interesting fact is that the average LoggedActivitiesDistance is relatively low compared to TotalDistance (0.117822km vs 5,979513km). Obviously users do not actively log their activity. This could point into the direction that users prefer automatic logging of their activity.
* Furthermore TotalDistance and TrackerDistance seem almost identical. Let's have a closer look later. 
* Sedentary active distance sometimes shows values > 0. Let's set all values to 0 since it is not supposed to be greater 0.
* TotalSteps, Tracker Distance, SedentaryMinutes and Calories have a relatively high standard deviations which shows different behavior. Let's group users later for better comparison.

In [None]:
#delete the column TotalDistance
daily_activity.drop('TotalDistance', inplace=True, axis=1)

In [None]:
#set all values of the SedentaryActiveDistance column to 0
daily_activity = daily_activity.assign(SedentaryActiveDistance=0)
daily_activity

Let's use the groupby() and mean() function to get some averages for each user:

In [None]:
g = daily_activity.groupby(['Id']).mean().reset_index()
g.sort_values(by='TotalSteps', inplace=True)
g[['Id','TotalSteps','Calories']]

Let's visualize the average steps per day for each user (based on measured days only).

In [None]:
#group the dataset by ID and then apply mean() function to get the average per user, 
#reset the index so all column names have the same hierarchy
g = daily_activity.groupby(['Id']).mean().reset_index()
h = g[['TotalSteps','Calories','Id']].sort_values(['TotalSteps']).reset_index().drop(['index'],axis=1)
order_Id = h['Id'].to_list()
fig, ax1 = plt.subplots(figsize=(12,6))
#change the orientation of the x-axis to 45 degrees due to length of user ID
plt.xticks(rotation=45, horizontalalignment='right')
sns.lineplot(data = h['Calories'], marker='o', ax=ax1, sort=False)
plt.xlabel('User ID', size=14)
plt.ylabel('Calories', size=14)
ax2 = ax1.twinx()
sns.barplot(data = h, x='Id', y='TotalSteps', alpha=0.5, ax=ax2, order=order_Id)
plt.title('Average daily calories (line) & total steps per user (bars)')
plt.rcParams['figure.dpi'] = 200;

Daily average steps as well as daily average calories vary a lot for the population. In addition we can see that people with similar daily steps show different calorie levels. This might be due to different levels of intensities or non-step based activities such as cycling or swimming. Running for half an hour consumes significantly more calories than walking for the same amount of time.  Therefore let's have a closer look at the breakdown of activity intensities (lightly active, fairly active, very active) per user in average.

In [None]:
# stacked barcharts with more than 2 values are tricky in seaborn. Therefore I use matplotlib

g = daily_activity.groupby(['Id']).mean().reset_index()
df = g[['Id','LightlyActiveMinutes','FairlyActiveMinutes','VeryActiveMinutes']]
  
fig, ax = plt.subplots(figsize=(12, 10))

#I prefer a horizontal bar orientation since it is easier to compare
df.plot(
    x = 'Id',
    kind = 'barh',
    stacked = True,
    mark_right = True, 
    color=['#D3E0F9','#7AA3F1','#3D68CF'],
    width=0.7,
    ax=ax)

plt.xlabel('Minutes', size=14)
plt.ylabel('User ID', size=14)
plt.title('Breakdown of activity intensities per user (averaged)', size=18)
plt.legend(loc='lower right');

The intensities also strongly vary among the population. Some users have almost no very active minutes which means that they almost never perform sports or activities bigger than 6 METs, such as jogging, swimming, cycling (faster speeds), weight-lifting. 


In [None]:
g[g['VeryActiveMinutes']>30]

8 users have imore than 30 very active minutes daily in average. 

In [None]:
sns.lmplot(x='TotalSteps',y='Calories',data=g,height=4,aspect=3)
plt.title('Relationship between TotalSteps & Calories');

There is no very clear positive correlation between total steps and calories. The regression line which minimizes the distance to all points to the line doesn't support a strong correlation. Only a few datapoints would be needed to change the slope from positive to negative. Let's check the exact correlation of all relevant features via a heatmap.

In [None]:
sns.heatmap(g[['TotalSteps','TrackerDistance','LoggedActivitiesDistance',
               'VeryActiveMinutes','FairlyActiveMinutes','LightlyActiveMinutes','SedentaryMinutes',
               'Calories']].corr(),cmap='Reds',linecolor='white',linewidths=1,annot=True)
plt.title('Correlations of selected features')

We could already tell from the graphs before that TotalSteps and Calories only have a correlation coefficient of 0.39 which is relatively low. A lower correlation coefficient has a different (lower) significance compared to a higher one. The relationship between TotalSteps and TrackerDistance is as expected almost perfect. The second highest correlation is between VeryActiveMinutes and TrackerDistance (0.71). This stronger correlation might support the thesis that the more a user works out at higher intensities (e.g. running), the higher the TrackerDistance. Third highest correlation is between VeryActiveMinutes and Calories. There is almost not correlation between SedentaryMinutes and Calories. It's interesting which narratives we come up with even though the sample size is extremely low. In reality there is always a high chance that not A leads to B but a combination of various factors (C, D, ...) impacts B in a certain range. Therefore let's only assign those correlations a preliminary validity which is to be proven by a larger dataset.


In [None]:
daily_activity['LoggedActivitiesDistance'].value_counts()

Users only logged their activities for 32 times out of 853 (verified) times. 821 times activity hasn't been logged.

Let's verify the hypothesis if people are more active at the weekend by checking calories and daily steps based on week days.

In [None]:
daily_activity.groupby(['day_of_week']).mean().reset_index().drop(['Id'],axis=1)

Tuesday and Saturday have the highest average step number as well as calorie consumption. Friday and Sunday are those with the lowest. Our initial hypothesis is only partly true. Let's plot the results. (However the difference is only about 10-15%, not very significant).

In [None]:
from matplotlib.gridspec import GridSpec
sns.set(style="whitegrid")

w = daily_activity.groupby(['day_of_week_str']).mean().reset_index()

plt.figure(2, figsize=(20,15))
the_grid = GridSpec(2, 2)

plt.subplot(the_grid[0, 1], title='')
sns.barplot(x='TrackerDistance', y='day_of_week_str', data=w.sort_values(['TrackerDistance']).reset_index(drop=True), palette='Spectral', orient='horizontal')

plt.subplot(the_grid[0, 0], title='')

sns.barplot(x='Calories', y='day_of_week_str', data=w.sort_values(['Calories']).reset_index(drop=True), palette='Spectral', orient='horizontal')

plt.suptitle('Weekly activity breakdown', fontsize=18);

For completeness, let's also look into the time when people are most active. I want to approach this problem via couting the calories burned per hour aggregated for the whole population. 

### Hourly calories dataset

In [None]:
hourly_calories.head(5)

In [None]:
hourly_calories.info()

There are no null values. Let's convert the column activity hour into a Python date time object.

In [None]:
#check how many users have provided at least one data point
hourly_calories['Id'].nunique()

In [None]:
#convert to pandas date time object
hourly_calories['ActivityHour_new']= pd.to_datetime(hourly_calories['ActivityHour'])
hourly_calories.drop(labels=['ActivityHour'], axis=1, inplace = True)
hourly_calories.head(3)

Let's create a new column which assigns the day of the week and then plot two diagrams: average calories per hour during weekdays and during the weekend. 

In [None]:
#extract week day from column ActivityHour_new and create a new column for that
#Monday = 0, Tuesday = 1, ... Saturday = 5, Sunday = 6
hourly_calories['day_of_week'] = pd.to_datetime(hourly_calories['ActivityHour_new']).dt.dayofweek
#weekend
hourly_calories_weekend = hourly_calories[hourly_calories['day_of_week']>=5]
#week day
hourly_calories_weekdays = hourly_calories[hourly_calories['day_of_week']<5]

In [None]:
from matplotlib.gridspec import GridSpec
sns.set(style="whitegrid")

#data
weekday = hourly_calories_weekdays.groupby(hourly_calories_weekdays["ActivityHour_new"].dt.hour).mean().reset_index()
weekend = hourly_calories_weekend.groupby(hourly_calories_weekend["ActivityHour_new"].dt.hour).mean().reset_index()

plt.figure(2, figsize=(20,15))
the_grid = GridSpec(2, 2)

# first plot weekdays
plt.subplot(the_grid[0, 1], title='Weekdays')
sns.barplot(x='Calories', y='ActivityHour_new', data=weekday, palette='Spectral', orient='horizontal')

# second plot weekend
plt.subplot(the_grid[0, 0], title='Weekends')
sns.barplot(x='Calories', y='ActivityHour_new', data=weekend, palette='Spectral', orient='horizontal')

# overall title
plt.suptitle('Hourly calorie breakdown', fontsize=18);

On weekends people are most active from 12-3pm. During the week from 5-8pm (which is probably for most people after work).

### Weight data

Now let's have a look at the second csv file, the weight_info file.


In [None]:
weight_info.head(5)

In [None]:
weight_info.info()

In [None]:
weight_info.describe()

* The date is again in string format. Let's convert it into a Pandas date type object.
* There are only 67 entries out of originally 940 entries, which could show again that only few users adjust settings or provide additional input over time if all devices would support this feature. Let's check how many unique users have edited their weight.
* The body fat percentage column ('Fat') shows 65 rows of missing data (NaN), only 2 valid data points are available. Let's delete that column
* Checking the meaning of IsManualReport on the FitBit website (https://www.fitabase.com/media/1748/fitabasedatadictionary.pdf), shows that the value True means that there was a manual input and False that the value was automatically received from a connected scale. Obviously the participants of the survey own more than one Fitbit product or were equipped with it as part of the study. Let's count the values later.
* The logId is the unique logID into the FitBit system. Due to privacy concerns, it shouldn't have been listed here unless it has been anonymized. Let's delete this column.
* Maximum BMI is 47.5, which is almost morbid obesity. Let's check if this outlier fits the other data we have for this user.

In [None]:
weight_info['Id'].unique()

Only 8 out of 33 users have provided any weight related data.

In [None]:
#delete the columns LogId and Fat
weight_info.drop('LogId', inplace=True, axis=1)
weight_info.drop('Fat', inplace=True, axis=1)

In [None]:
#create a new column 'date_new' based on 'Date'
weight_info['date_new'] = weight_info['Date']
#convert the object data type into a Pandas date time object
weight_info['date_new']= pd.to_datetime(weight_info['date_new'])
weight_info.drop(labels=['Date'], axis=1, inplace = True)
weight_info.head()

In [None]:
weight_info_grouped = weight_info.groupby(['Id']).mean()
weight_info_grouped['IsManualReport'].value_counts()

3 of 8 users who logged their weight have a scale that is connected to their FitBit account. Having a scale which is automatically connected is a huge improvement in terms of user experience. The user doesn't need to record the weight manually.

Let's rearrange the dataframe by ID to see the average BMI: 

In [None]:
weight_info.groupby(['Id']).mean().reset_index()


According to the WHO a BMI greater than 25 is considered overweight and above 30 is considered obese. Based on this 3 out of 8 participants who have provided weight data fall into the healthy weight category, 4 are considered overweight and one is considered severely obese.

In [None]:
def BMI_category(x):
    
    if x>18.5 and x<25:
        return "Normal weight"
    elif x>=25 and x<30:
        return "Overweight"
    elif x>=30:
        return "Obese"

df = weight_info.groupby(['Id']).mean().reset_index()   
#use lambda function to create a new column based on ranges   
df['BMI_category'] = df['BMI'].apply(lambda x: BMI_category(x))
df

fig, ax1 = plt.subplots(figsize=(6,3))
sns.countplot(x='BMI_category', data=df, order=['Normal weight', 'Overweight', 'Obese']);
plt.title('Breakdown of BMI');


Let's compare the first recorded weight with the last one to see if there is any weight loss/gain.

In [None]:
#create list of unique IDs
weight_id = weight_info['Id'].unique()
#for loop to append earliest_date, latest_date and weight_dif list
earliest_date = []
latest_date = []
weight_dif = []
for x in weight_id:
    #use min() and max() function to find earliest/latest recording
    #alternative: use sort_values function 
    a = weight_info[weight_info['Id']==x]
    earliest_date.append(a['date_new'].min())
    latest_date.append(a['date_new'].max())
    #sort values by date in order to pick first and last weight
    b = a.sort_values('date_new')
    #calculate difference between earliest recorded and last recorded weight
    dif = b['WeightKg'].tolist()[-1]-b['WeightKg'].tolist()[0]
    weight_dif.append(dif)

#check the number of weight records for each user 
number_of_records = []
for x in weight_id:
    individual_records = weight_info[weight_info['Id']==x]['Id'].tolist()
    c = len(individual_records)
    number_of_records.append(c)

#convert lists into dataframe
a_dict = {"Id":weight_id, "earliest_date":earliest_date, "latest_date":latest_date, "weight_difference": weight_dif, "number_of_records":number_of_records}
new_df = pd.DataFrame(a_dict)
difference = new_df['latest_date']-new_df['earliest_date']
new_df.insert(3,'time_difference', difference, True)
new_df


* Weight data of only 8 users out of 33 is available (due to unknown reasons such as not supported by device, unwillingness of user, etc.)
* For 2 out of these 8 users there is only one record available, therefore no weight difference can be measured
* In order to measure any weight loss or gain, short periods less than 7 days don't make sense. Only 5 users have provided weight records extending over a period of >7 days (difference between the earliest and latest weight record)
* Furthermore the latest weight record is mostly (much) before the end of the observation period, therefore we don't know the user behavior after that (weight loss could follow weight gain)
* Conclusion: Statistically the number of users is too small, to be representative for a larger population and the quality of the data is not sufficient to draw any conclusions in terms of weight loss/gain. Therefore correlations with other features will be calculated only for the sake of practicing data analysis. Results will not be considered in the summary. The only interesting observation is that 3 of 8 users own a scale that is connected to their FitBit account. Having a scale which is automatically connected would be a huge improvement in terms of user experience. The user doesn't need to record the weight manually and only step onto the scale while the phone is closeby. This hypothesis could be checked by interviewing some those users.

Let's check further fitness data for the user 1927972279 with BMI of 47.5

In [None]:
weight_info[weight_info['Id']==1927972279]['WeightKg']

In [None]:
daily_activity[daily_activity['Id']==1927972279]

Daily steps show that this user doesn't move a lot which is probably related to the high body weight of 133kg (at a low body height if you calculate the height based on the BMI formula backwards). For this specific user the entries where TotalSteps equal 0 which we deleted before might make sense. Let's revisit that data and compare this to the weight information we have.

In [None]:
#reload the daily_activity csv file since some rows were deleted before
daily_activity_raw = pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
daily_activity_raw[['Id','TotalSteps','SedentaryMinutes','Calories']].groupby(['Id']).mean() 

Only one user (1927972279) has an average of daily steps less than 1000. Therefore let's stick to the previous logic of deleting all values where daily steps are less than 100.

Let's check if there is a relationship between BMI and calories respectively sedentary minutes. The number of users with weight data is only 7, but let's do the comparison for the sake of data analytics which could be ideally applied to a larger dataset in the future.
As a first step we have to merge two dataframes.

In [None]:
df1 = daily_activity.groupby(['Id']).mean().reset_index()
df2 = weight_info.groupby(['Id']).mean().reset_index()
df1['BMI'] = df1.merge(df2, on='Id')['BMI']
# delete all rows with NaN
df1 = df1.dropna()
df1[['Id','Calories','BMI']].sort_values('BMI')

In [None]:
df1[['Id','Calories','BMI']].corr()

The correlation between calories and BMI is really weak for this dataset. As mentioned before the sample data is by far not sufficient. Therefore these results will not be further considered.

### Sleep dataset

Let's move ahead and look into sleep related data.

In [None]:
sleep_day.head(5)

In [None]:
sleep_day.info()

In [None]:
sleep_day['SleepDay_new']= pd.to_datetime(sleep_day['SleepDay'])
sleep_day.drop(labels=['SleepDay'], axis=1, inplace = True)
sleep_day

Below you find the column explanations according to FitBit's data dictionary (https://www.fitabase.com/media/1748/fitabasedatadictionary.pdf):
* TotalSleepRecords: Number of recorded sleep records for that day 
* TotalTimeinBed: Total minutes spent in bed, including awake, light, deep, and REM sleep, during a defined sleep record
* TotalMinutesAsleep: Total number of minutes classified as being “asleep” sum total of light, deep, and REM sleep



In [None]:
sleep_day.describe()

In [None]:
sleep_day['Id'].nunique()

* The dataset doesn't contain any null values 
* The dataset consists of 413 entries out of 1023 maximum possible entries (40,4%)
* Only 24 users show any sleep data (some people might not have worn the device at night due to charging or other reasons). 
* The maximum TotalMinutesAsleep value is 796 minutes or 13.3 hours. Probably that user has slept for several times a days which can be verified later. 
* Minimum TotalMinutesAsleep is 58 minutes.


In [None]:
sleep_day[sleep_day['TotalMinutesAsleep'] > 600]

There are 18 records where the user slept longer than 10 hours. 50% of the users slept several times a day (TotalSleepRecords > 1).

Let's calculate the time users are awake while in bed:

In [None]:
sleep_day['wake_time']= sleep_day['TotalTimeInBed']-sleep_day['TotalMinutesAsleep']
sleep_day.groupby('Id').mean()

FitBit calculates time asleep based on a combination of movement and heart-rate patterns (https://help.fitbit.com/articles/en_US/Help_article/2163.htm). If we compare time in bed with the actual time asleep, the gap might indicate the user is:
* restless and cannot fall asleep
* is still awake (e.g. reading or using the phone)
* woke-up (several times) during sleep
* doesn't get up immediately the next morning. 

It's difficult to attribute the exact reason for this. In general the gap between both values varies quite a lot. 

Let's merge the sleep_day dataframe with the daily_activity dataframe and check how the activity intensity impacts sleep duration. 


In [None]:
df1 = sleep_day.groupby('Id').mean().reset_index()
#create a function that groups users in 3 categories: sleep <7 hours, sleep 7-8 hours, sleep >8 hours
def assign_ranges(x):
    if x<420:
        return "less than 7 hours"
    elif x>=420 and x<480:
        return "7-8 hours"
    else:
        return "more than 8 hours"
#use lambda function to create a new column based on ranges   
df1['TotalTimeInBed_ranges'] = df1['TotalTimeInBed'].apply(lambda x: assign_ranges(x))
sns.countplot(x="TotalTimeInBed_ranges", data=df1, order=["less than 7 hours", "7-8 hours", "more than 8 hours"]);
plt.title('Total time in bed');

Let's merge the data with the daily activity dataframe and check how the intensities affect the sleep behavior.

In [None]:
df2 = daily_activity.groupby(['Id']).mean().reset_index()

# merge both dataframes based on Id column, only Ids common in both dataframes remain
merged_df = pd.merge(df2, df1, on="Id")
merged_df.head(5)

In [None]:
def assign_ranges2(x):
    if x<15:
        return "less than 15 min"
    elif x>=15 and x<30:
        return "15-30 min"
    else:
        return "more than 30 min"
#use lambda function to create a new column based on ranges   
merged_df['VeryActiveMinutes_ranges'] = merged_df['VeryActiveMinutes'].apply(lambda x: assign_ranges2(x))
merged_df
sns.countplot(x='TotalTimeInBed_ranges', data=merged_df, hue='VeryActiveMinutes_ranges', order=['less than 7 hours', '7-8 hours', 'more than 8 hours'])
plt.title('Relationship between VeryActiveMinutes and Total Time in Bed');

In [None]:
new = merged_df[['TotalSteps','Calories','TotalMinutesAsleep','TotalTimeInBed','wake_time','VeryActiveMinutes_ranges']]
new.groupby('VeryActiveMinutes_ranges').mean()

It is pretty obvious that people who are very active for more than 30 min accumulate the highest number of average calories. In average this group only sleeps for 6 hours a day with the lowest awake time. For the other two groups (which are very active for less than 15 min and 15-30 min) total minutes asleep and awake time are almost similar.

## Summary of main findings (to be treated with caution due to 3rd party data)

* The data shows a huge variation among the users in terms of measured features: daily steps, calories, intensities, etc. 
* For 15.6% of the observed time period there is no data available (in reality this period could be even bigger since the users were paid for their data via Amazon Mechanical Turk which might have incentivized them) 
* Only 24 out of 33 users provide any sleep related data (device wasn't worn at night or didn't support this feature) which only covers 40.4% of all possible observations
* Users do not log their activity for the majority of the time
* Only 8 out 33 users provided any weight related data. 3 out of these 8 probably have a scale that automatically connects to the FitBit device.
* Tuesday and Saturday have the highest average step number as well as calorie consumption. Friday and Sunday are those with the lowest.
* On weekends people are most active from 12-3pm. During the week from 5-8pm (which is probably for most people after work).
* There is only a weak correlation between daily steps and calories, but a high correlation between veryactiveminutes and calories and veryactiveminutes and tracker distance
* Users who were very active for more than 30 minutes per day accumulated the highest amount of calories but also slept the lowest amount of time (approx. 6 hours) and shortest awake time
* Body fat was only measured correctly twice (if not caused by data manipulation), sedentarydistance was sometimes incorrect (>0)




## Recommendations 

The following findings/hints should be verified by another/new dataset with more transparency about the data collection method.

* Offer more customization possibilities to users. Data shows huge variation among user features (average steps, workout intensities, weight, etc.). Different user groups require different motivations, programs, etc. A more targeted approach might improve customer retention.
* Improved battery life or charging recommendations could help that people wear the device more often. 
* Automated tracking of activities instead of manual input (FitBit data shows that if scale is connected to FitBit automatically it might increase the likelihood to receive weight data which can offer more detailed analysis and recommendations to the user). Algorithms allow today to automatically detect activites (such as running, swimming, cycling).
* Instead of manual input, additional user information could be automatically detected by face recognition in the Bellabeat app if the user consent has been obtained, e.g. detect gender, age, BMI based on camera, similar as Anura app)
* Reminders on the phone, participation in competitions (social) or goal-setting could help increase the user's engagement
* Due to the very small screen of the wearable, the user faces difficulties to navigate through the settings,therefore user experience is key --> automation and accessibility for users with bad eyes, etc. should be ensured (bigger fonts, simplified menu, enlarged icons, etc.)
* It seems there were some technical issues indicated by the data (low number of steps on certain days, sedentarydistance, body fat, etc.). It is important that the device is reliable

A few findings from other studies can be found below which are partly supported by findings from the FitBit dataset:
* Many users are not satisfied with the usability 
* The longer the device is used, the more satisfactory the users are (especially health professionals)
* Users show a low brand loyalty
* Strong use cases strengthen usage of device such as people with health problems (diabetes, high blood pressure) or athletes
* Intelligent recognition functions are required (e.g. algorithms which can track activity automatically)
* Comfort, charging burdens and functions are key areas to focus on
* Most devices show similar functions, therefore new technologies could offer a competitive advantage by new features 

Sources:
* Liang J, Xian D, Liu X, Fu J, Zhang X, Tang B, Lei J: Usability Study of Mainstream Wearable Fitness Devices: Feature Analysis and System Usability Scale Evaluation
JMIR Mhealth Uhealth 2018;6(11):e11066
URL: https://mhealth.jmir.org/2018/11/e11066
DOI: 10.2196/11066
* Keogh A, Dorn JF, Walsh L, Calvo F, Caulfield B. Comparing the Usability and Acceptability of Wearable Sensors Among Older Irish Adults in a Real-World Context: Observational Study. JMIR Mhealth Uhealth. 2020 Apr 20;8(4):e15704. DOI: 10.2196/15704. PMID: 32310149; PMCID: PMC7199137.

Thanks for reading through my project. Looking forward to your feedback!