# Bellabeat Data Analysis
## Introduction
Bellabeat is a high-tech manufacturer of health-focused products for women. It is a successful small company, but they have the potential to become a larger player in the global smart device market. Bellabeat believes that analyzing smart
device fitness data could help unlock new growth opportunities for the company. Their product include:
<ul>
<li>Bellabeat app</li>
<li>Leaf: Bellabeat’s classic wellness tracker connected to Bellabeat app </li>
<li>Time: A wellness watch provides you with insights into your daily wellness </li>
<li>Spring: This is a water bottle that connects to the Bellabeat app to track your hydration levels</li> 
</ul>
This analysis will contain five steps as below:
<ol>
<li>Define business task</li>
<li>Data Preparation </li>
<li>Data Processing</li>
<li>Data Analysis</li>
<li>Results</li>
</ol>

## Business Task
<p>
Bellabeat wants to expand their business for one of their products. By gaining the insight into how consumers use non-Bellabeat smart devices, we can use this to apply for their goods and recommend some marketing strategy for the company.</p>

<b>Key stakeholders</b>
<ul>
<li>Urška Sršen: Bellabeat’s co-founder and Chief Creative Officer</li>
<li>Sando Mur: Mathematician and Bellabeat’s co-founder</li>
<li>Bellabeat marketing analytics team</li>
</ul>

## Data Preparation
The dataset was downloaded from Kaggle, unzipped and created a folder in OneDrive. Then I uploaded the data into Python for cleaning and analyzing it.

<i> Import the libraries </i>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline

<i>Import the datasets</i>

In [None]:
daily_activities=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv')
daily_calories=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv')
daily_intensities=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv')
daily_steps=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv')
hourly_steps=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv')
hourly_intensities=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/hourlyIntensities_merged.csv')
sleep_day=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv')
weight_index=pd.read_csv('../input/fitbit/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv')

<i>In the downloaded data folder there are 18 csv files. There are daily, hourly, minute data about activity, intensity, calories,...<i/>
    
<i>In this study, I just focus on daily and hourly information to simplify the analysis.<i/>

## Data Process

<i>Explore the datesets</i>

In [None]:
daily_activities.info()
daily_calories.info()
daily_intensities.info()
daily_steps.info()

<i>As seen from data, all variables from those files are in the daily_activities. So that I just use daily_activities for my analysis</i>

In [None]:
sleep_day.info()
weight_index.info()

In [None]:
sleep_day.head()

<i>Let’s check the data type and the statistical summary of each variable <i/>

In [None]:
weight_index.head()

In [None]:
len(weight_index.Id.unique())

In [None]:
len(daily_activities.Id.unique())

In [None]:
len(sleep_day.Id.unique())

<i> As we see that not all users track their weight index and sleeptime. So if we use the weight_info data, we must be careful because the data can be bias<i/>

In [None]:
sleep_day.columns=['Id', 'Date', 'TotalSleepRecords', 'TotalMinutesAsleep',
       'TotalTimeInBed']
daily_activities.columns=['Id', 'Date', 'TotalSteps', 'TotalDistance', 'TrackerDistance',
       'LoggedActivitiesDistance', 'VeryActiveDistance',
       'ModeratelyActiveDistance', 'LightActiveDistance',
       'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes',
       'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories']

<i> Change name of columns to clean data <i/>

In [None]:
# Change all date columns in three tables into same format
sleep_day.Date=pd.to_datetime(sleep_day['Date'], format="%m/%d/%Y %I:%M:%S %p").dt.strftime("%m/%d/%Y")
daily_activities.Date=pd.to_datetime(daily_activities['Date'], format="%m/%d/%Y").dt.strftime("%m/%d/%Y")
weight_index.Date=pd.to_datetime(weight_index['Date'], format="%m/%d/%Y %I:%M:%S %p").dt.strftime("%m/%d/%Y")

In [None]:
daily_activities.describe()

## Data Analysis
### Users's daily activity

In [None]:
#Change date into day of week
index_day=[]
week_day=[]
for i in range(len(daily_activities.Id)):
    day=dt.datetime.strptime(daily_activities.Date[i], "%m/%d/%Y")
    week_day.append(day.strftime("%A"))
    index_day.append(day.weekday())
    i=i+1
daily_activities["DayofWeek"]=week_day
daily_activities["IndexofWeek"]=index_day
gb_daily_activities=daily_activities.groupby(by=['DayofWeek']).mean().reset_index()
gb_daily_activities.sort_values("IndexofWeek")
sns.set_theme(style="darkgrid")
plt.figure(figsize=(10,8))
ax=sns.barplot(x="DayofWeek", y="TotalSteps",data=gb_daily_activities.sort_values("IndexofWeek"), palette="Accent_r")
ax.set_ylabel("Average Steps",fontsize=20)
ax.set_xlabel("Day of Week ",fontsize=20)
ax.set_title("Average Steps During Week ",fontsize=25)
plt.show()

<i>As seen from chart, Saturday and Tuesday are ones users work out mostly. However, there is unclear difference among them<i/>

### User's hourly activity

In [None]:
hourly_steps.head()

In [None]:
hour=[]
for i in range(len(hourly_steps.Id)):
    day=dt.datetime.strptime(hourly_steps.ActivityHour[i], "%m/%d/%Y  %I:%M:%S %p")
    hour.append(day.strftime("%H"))
    i=i+1
hourly_steps["Hour"]=hour
gb_hourly_steps=hourly_steps.groupby('Hour').mean().reset_index()
sns.set_theme(style="darkgrid")
plt.figure(figsize=(10,8))
hourly_chart=sns.barplot(data=gb_hourly_steps,x="Hour",y="StepTotal")
hourly_chart.set_ylabel("Average Steps",fontsize=20)
hourly_chart.set_xlabel("Hour",fontsize=20)
hourly_chart.set_title("Hourly distribution of user steps",fontsize=25)
plt.show()

<i> Users have the most average steps in the period between 18h-19h. It it supposed that that is when users leave work, leave school and have freetime for exercise <i/> 

### Sleep quality and Intensity

In [None]:
sleep_day["TimeTakeToSleep"]=sleep_day["TotalTimeInBed"]-sleep_day["TotalMinutesAsleep"]
hourly_intensities.columns=['Id', 'Date', 'TotalIntensity', 'AverageIntensity']
hourly_intensities.Date=pd.to_datetime(hourly_intensities['Date'], format="%m/%d/%Y %I:%M:%S %p").dt.strftime("%m/%d/%Y")
gb_hourly_intensities=hourly_intensities.groupby(["Date","Id"]).sum().reset_index()
sleep_and_intensities=pd.merge(sleep_day,gb_hourly_intensities,on=["Date","Id"],how="inner")
f, axes= plt.subplots(1,3, figsize=(12,6))
k1=sns.regplot(data=sleep_and_intensities, x='TimeTakeToSleep', y='TotalIntensity',ax=axes[1])
k2=sns.regplot(data=sleep_and_intensities, x='TotalMinutesAsleep', y='TotalIntensity',ax=axes[0])
k2=sns.regplot(data=sleep_and_intensities, x='TotalTimeInBed', y='TotalIntensity',ax=axes[2])

<i> It is not clear that Intensity affects the sleep quality <i/>

### Intensity of exercise activity

In [None]:
labels = 'Very Active', 'Moderately Active', 'Light Active', 'Sedentary Active'
sizes = [sum(daily_activities.VeryActiveDistance)/sum(daily_activities.TotalDistance),sum(daily_activities.ModeratelyActiveDistance)/sum(daily_activities.TotalDistance),sum(daily_activities.LightActiveDistance)/sum(daily_activities.TotalDistance),sum(daily_activities.SedentaryActiveDistance)/sum(daily_activities.TotalDistance)]
explode = (0, 0, 0.03, 0)
textprops = {"fontsize":14}

fig1, ax1 = plt.subplots(figsize=(6,6))
a1=ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90,textprops =textprops)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("Intensity of Activity",{'fontsize':30},y=1.1)
plt.show()

<i> The most common level of activity during exercise is LIGHT <i/>

### Result

* Total steps and sleep quality doesnt have clear correlation, but they still move at the same direction widespreadly
* Users are keen on light activities. Bellabeat should have more functions focusing on this type of activities. It can provide more light exercises on Bella app for users to follow.
* Users donot often use app to track their weight indexang sleep quality. I think Bella should send notification to users to remind them. Also, based on their BMI, add function for Bella app to build a healthpaths for users.