# FitBit Bellabeat Case Study

#### The following notebook shows the analysis process of the FitBit Bellabeat Google Data Analytics Certification capstone case study. 

#### Note: although Python wasn't taught in the certification, I decided to use this language in the case study analysis.

## Scenario

#### As per course briefing:

#### You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products 
#### for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market.
#### Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new 
#### growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain 
#### insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company.
#### You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

## Phase 1 - Ask

### Business Task

#### Analyse smart device usage behaviour and identify users' current trends.
#### Understand how Bellabeat's marketing strategy can be improved based on such trends, thus driving engagement and increasing Bellabeat's revenue and user base. 

## Phase 2 - Prepare

#### The data used in the analysis was retrieved from the FitBit Fitness Tracker Data on Kaggle: https://www.kaggle.com/arashnic/fitbit .
#### However, it is also publicly avaialable on the Zenodo website: https://zenodo.org/record/53894#.YMoUpnVKiP9 .

#### The data is organised in different files, which contain health app user behaviour. Some files contain the wide version and some the narrow version of the same data.

#### It was generated by Amazon Mechanical Turk and contains records provided by 30 people which will hopefully be enough to reflect the entire population.
#### The data can be considered reliable, original, current (it dates back to 2016), comprehensive (it contains all the information needed to solve the business task) and cited (it was uploaded to Kaggle by an experienced user).

#### The data has a license, it is private since it is anonymised. In terms of accessibility, the data is open and available on the Zenodo website, as mentioned above. 

#### In terms of integrity, we can assume that data is accurate. It is complete in terms of included fields. It is also consistent and trustworthy.

## Phase 3 - Process

### Data manipulation through SQL

#### After downloading them, the files were uploaded to Big Query through Google Cloud (not possible to upload them directly to BigQuery becasue of file size). I therefore managed to set up a dataset made up of different tables - one table for each of the files uploaded. 

#### All tables are related to each other thanks to the Id and the time/date columns. These latter columns bear different names in the tables (e.g: ActivityDate). In some cases the date columns used to join tables have timeptamp as data type, so they needed to be cast to date. 

#### After analysing the contents of all the tables, I eventually decided to use the daily_activity and sleep_per_day_merged tables since they contain most of the metrics needed for the analysis. 

#### I joined the two tables with SQL and downloaded the view obtained as a csv file by using the following code:

#### SELECT da.Id,
#### da.ActivityDate AS date,
#### CAST(da.TotalSteps AS INTEGER) As steps,
#### da.Calories AS calories,
#### sl.TotalMinutesAsleep,
#### sl.TotalTimeInBed,
#### da.VeryActiveMinutes,
#### da.FairlyActiveMinutes,
#### da.LightlyActiveMinutes,
#### da.SedentaryMinutes,
#### ROUND(da.VeryActiveDistance,2) AS very_active_distance,
#### ROUND(da.ModeratelyActiveDistance,2) AS  moderately_active_distance,
#### ROUND(da.LightActiveDistance,2) AS lightly_active_distance,
#### da.SedentaryActiveDistance
#### FROM `case-studies-314618.fitbit_fitness_tracker_data.daily_activity` da INNER JOIN `case-studies-314618.fitbit_fitness_tracker_data.sleep_day_merged` sl 
#### ON da.Id = sl.Id AND sl.SleepDay = da.ActivityDate;

#### I then uploaded it to further manipulate and analyse the data through Python.

### Data manipulation through Python

In [None]:
#### Importing Pandas, Matplotlib, Seaborn.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
#### Uploading the csv file containing the relevant data.

df = pd.read_csv('../input/fitbit-bellabeat-case-study-selected-fields/case_study_2.csv')
df.describe()

In [None]:
#### Checking if there are nay missing values across columns

df.isnull().any()

In [None]:
#### Checking data format across columns

df.dtypes

In [None]:
#### Convertng date column to datetime

df['date'] = pd.to_datetime(df['date'])

In [None]:
#### Plotting distributions to check for outliers. Although there are some outliers, distributions tend towards normality, so I have decided to delete any outlier.
#### Note: to check other fields simply insert name in code.

sns.distplot(df['calories'])  

In [None]:
#### Counting duplicates

df.duplicated().sum()

In [None]:
#### Finding duplicates

df[df.duplicated(keep=False)]

In [None]:
#### Dropping duplicates

df = df.drop_duplicates()
df = df.reset_index(drop=True)

## Phase 4 - Analyse 
### Organising data for analysis

In [None]:
#### Adding 'in bed not sleeping' column

df['in_bed_not_sleeping'] = df['TotalTimeInBed'] - df['TotalMinutesAsleep']


In [None]:
#### Adding day of the week number and day of the week name columns

df['day_number'] = df['date'].dt.dayofweek
df['day_of_week'] = df['date'].dt.day_name()

In [None]:
#### Creating a new dataframe that aggregates averages by day of the week

avg_by_day = df.groupby("day_of_week")["steps","calories", "TotalMinutesAsleep", "TotalTimeInBed", "VeryActiveMinutes", "FairlyActiveMinutes", "LightlyActiveMinutes", "SedentaryMinutes", "very_active_distance", "moderately_active_distance", "lightly_active_distance", "SedentaryActiveDistance", "in_bed_not_sleeping"].mean()
pd.DataFrame(avg_by_day)
avg_by_day=avg_by_day.assign(weekday = ['Friday','Monday','Saturday','Sunday','Thursday','Tuesday','Wednesday'])
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
avg_by_day['weekday'] = pd.Categorical(avg_by_day['weekday'], categories=days, ordered=True)
avg_by_day = avg_by_day.sort_values('weekday')
avg_by_day

avg_by_day.reset_index(drop = 'true')

## Continuing Phase 4 - Analyse + Phase 5 - Share

### Plotting data

In [None]:
#### Average calory consumption by day of the week

x = avg_by_day['weekday'] 
y = avg_by_day['calories'].round(0)

fig = plt.figure(figsize=(10,8))
plt.bar(x, y)

for i in range(len(x)):
  plt.text(i, y[i], y[i], ha="center", va="bottom")

plt.ylabel('Calories')
plt.title('Average Calories by Day')

#### The days on which people consume more calories are Monday, Tuesday and Saturdays.

In [None]:
#### Average steps taken by day of the week

x = avg_by_day['weekday'] 
y = avg_by_day['steps'].round(0)

fig = plt.figure(figsize=(10,8))
plt.bar(x, y)

for i in range(len(x)):
  plt.text(i, y[i], y[i], ha="center", va="bottom")

plt.ylabel('Steps')
plt.title('Average Steps by Day')
plt.show()

#### The days on which people take most steps are Saturday (first), Monday (second), Tuesday (third). 
#### The day on which people take least steps is Sunday. We might assume this is because Sunday is a day for resting for most people.

In [None]:
#### Average very active minutes by day 

x = avg_by_day['weekday'] 
y = avg_by_day['VeryActiveMinutes'].round(0)

fig = plt.figure(figsize=(10,8))
plt.bar(x, y)

for i in range(len(x)):
  plt.text(i, y[i], y[i], ha="center", va="bottom")

plt.ylabel('Very Active Minutes')
plt.title('Average Very Active Minutes by Day')
plt.show()

#### People are most active on Monday and Tuesday, followed by Saturday.

In [None]:
#### Average fairly active minutes by day 

x = avg_by_day['weekday'] 
y = avg_by_day['FairlyActiveMinutes'].round(0)

fig = plt.figure(figsize=(10,8))
plt.bar(x, y)

for i in range(len(x)):
  plt.text(i, y[i], y[i], ha="center", va="bottom")

plt.ylabel('FairlyActiveMinutes')
plt.title('Average Very Active Minutes by Day')
plt.show()

#### The day with most fairly active minutes is Saturday, followed by Monday and Tuesday.The day with least fairly active minutes is Friday, probably because most people enjoy going out on that day.

In [None]:
#### Average minutes in bed by day 

x = avg_by_day['weekday'] 
y = avg_by_day['TotalTimeInBed'].round(0)

fig = plt.figure(figsize=(10,8))
plt.bar(x, y)

for i in range(len(x)):
  plt.text(i, y[i], y[i], ha="center", va="bottom")

plt.ylabel('Minutes in Bed')
plt.title('Average Minutes in Bed by Day')
plt.show()

#### Sunday is the day on which people spend most time in bed, as can be expected. 

### Conclusions

#### Mondays and Tuesdays are the two days of the week on which users take most active exercise. It is the start of the week, they probably feel energised and are keen to take more intense and active exercise.

#### Saturday is the day on which users take more steps, but the type of exercise they take is fairly active, instead of very active. We can assume people have more free time on Saturday and they prefer to spend time on relaxing walks instead of intense exercise.

#### Sunday is they day on which people enjoy resting the most, with the least steps and a long time in bed.

#### Friday is the day on which people like to go out the most, with low very and fairly active minute values but also short time in bed.


### Phase 6 - Act

#### The most appropriate product we can apply these insights to is the Bellabeat app, since it provides users with heath data about their habits. The insights identified through this analysis are related to the data provided by this type of product as a matter of fact. 

#### Within the app, users could be provided with personally customised reports outlining their exercise and sleeping patterns, identifying opportunities to improve their health habits.