# Fitbit Data Visualisation in Python
(See associated [blog posting](https://vgelinas.github.io/post/fitbit-data-exploration-part-i/)).

In this project we will explore some Fitbit activity data pulled via [orcasgit's python-fitbit api](https://github.com/orcasgit/python-fitbit). We will go through the following steps:
1. Data collection
2. Data cleaning
3. Data visualisation

### Dependencies
* Python 3+
* The [python-fitbit api](https://pypi.org/project/fitbit/)
* The [ratelimit package](https://pypi.org/project/ratelimit/)
* The datetime, json, matplotlib and pandas standard libraries

Let's load our packages.

In [None]:
import fitbit
import json
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from ratelimit import limits, sleep_and_retry

%matplotlib inline

## 1. Data Collection
We do this in two steps:
* We first access the API via python-fitbit, dealing with the necessary authentication steps.
* We then sample some responses, and build datasets by querying over a range of dates.

### 1.1. Authentication setup

To collect personal data, we first need to [set-up a Fitbit app](https://dev.fitbit.com/apps/new), and to collect the client_id and client_secret for this app. For this project I've chosen to keep these in a credentials.json file stored in a dedicated subfolder named 'oauth', but just make sure you have these on hand.

In [None]:
!cat oauth/credentials.json

We also need tokens for authentication. We need:

* An access token.
* A refresh token.
* An expiration time for the access token (the refresh token never expires).

These can be obtained by going to the [Manage my apps](https://dev.fitbit.com/apps) section on the Fitbit website, selecting your app and navigating to "OAuth 2.0 tutorial page". Alternatively, you can run the script "gather_keys_oauth2.py" from the python-fitbit [github page](https://github.com/orcasgit/python-fitbit), in which case you should set your Fitbit app's callback URL to https://127.0.0.1:8080/. 

The access token serves to authenticate and typically expires after ~8 hours. The refresh token is then used to obtain a new pair (access_token, refresh_token) from the API. Similar to above, I chose to store these in a json file named 'tokens'.

In [None]:
!cat oauth/tokens.json

The only important keys above are "access_token", "refresh_token" and "expires_at" (the rest corresponds to optional arguments). 

Next up, the code below instantiates a fitbit client which will handle API calls for us. We pass along the credentials and tokens as arguments, and we also pass a "token refresh" function which will store the new (access_token, refresh_token) pair sent by the API whenever the first one expires. 

In [None]:
# Load credentials
with open("./oauth/credentials.json", "r") as f:
    credentials = json.load(f)

# Load tokens
with open("./oauth/tokens.json", "r") as f:  
    tokens = json.load(f)  

client_id = credentials['client_id'] 
client_secret = credentials['client_secret']
access_token = tokens['access_token']
refresh_token = tokens['refresh_token']
expires_at = tokens['expires_at'] 

# Token refresh method 
def refresh_callback(token):   
    """ Called when the OAuth token has been refreshed """ 
    with open("./oauth/tokens.json", "w") as f: 
        json.dump(token, f)  

# Initialise client  
client = fitbit.Fitbit(client_id=client_id, 
                       client_secret=client_secret,
                       access_token=access_token,
                       refresh_token=refresh_token,
                       refresh_cb=refresh_callback)

The first time this is called you should be served an authorisation page for authentication, but afterwards the refresh token song & dance should handle this in the background, and we won't need to set it up again unless you lose your tokens.

### 1.2. A first look at the response data
The python-fitbit api supports the methods listed [here](https://python-fitbit.readthedocs.io/en/latest/#fitbit-api). For example, we could call:

* **client.sleep**, to get basic sleep data (bed time and wake time, time awake at night, ...).
* **client.activities**, to get timestamps for activities (walking, running, cycling, ...) and summary data (number of steps, minutes active, ...).
* **client.intraday_time_series**, to get granular data on various activities (such as heart rate or steps rate for every minute of the day).

We'll be interested in the activities and intraday steps data. Now, let's take a look at the response for one date, say May 1st.

In [None]:
# Get activity data for May 1st
# The API takes a date formatted as 'YYYY-MM-DD'
date = '2020-05-01'
activities_response = client.activities(date=date)

# Display response
activities_response

Let's look at the type of the response object.

In [None]:
type(activities_response)

The response consists of nested dictionaries. We'll extract two datasets from the 'activities' and 'summary' keys.

In [None]:
# Get activities dataset
activities = activities_response['activities']
activities = pd.DataFrame(activities)
activities

In [None]:
# Get summary dataset
summary = activities_response['summary']

# Remove sub-dictionaries
del summary['distances']
del summary['heartRateZones']

summary = pd.DataFrame(summary, index=[0])  # all values are scalars, must pass an index
summary

Next, let's look at the intraday step data.

In [None]:
# Get intraday steps data
steps_response = client.intraday_time_series('activities/steps', base_date=date, detail_level="1min")

# Extract dataset from response object
steps = steps_response['activities-steps-intraday']['dataset']

# Display dataset
steps = pd.DataFrame(steps)
steps

We get the minute-by-minute count of steps on that day. Let's take a quick look at a plot.

In [None]:
steps.plot()
plt.show()

### 1.3. Collect activity and intraday steps data since October 1st.

We can now build our datasets, which will consists of general activity data and intraday steps data from October 1st to yesterday. We will:

* Produce a list of dates in 'YYYY-MM-DD' string format for our queries.
* Query the API for each date, extracting our 'activities', 'summary' and 'steps' datasets from the response.
* Limit our query rate to 150/hour (since this is the Fitbit API rate limit).
* Combine and store the results.

First, let's get a list of dates. We can use the pandas **date_range** method to produce a list of datetime objects, and format them using the **strftime** method.

In [None]:
# Get date range from October 1st to yesterday
start = pd.to_datetime("2019-10-01")
date_range = pd.date_range(start=start, end=datetime.today() - timedelta(days=1))
date_range = [datetime.strftime(date, "%Y-%m-%d") for date in date_range]
date_range[-5:]

Next, we query the API for each date in date_range. 

As seen when we first took a look at the response data, we actually make two API calls per date (i.e. client.activities and client.intraday_time_series). Since the Fitbit API has a rate limit of 150 calls/hour, we should query at most 75 dates an hour. We can accomplish this via the [ratelimit](https://pypi.org/project/ratelimit/) package, which lets you limit the number of times a function is called over a time period.

Finally, we call the API for each day, timestamp the resulting datasets, and store the total in csv files locally.
We do this for each of the 'activities', 'summary' and 'steps' datasets. The script below accomplishes this.

In [None]:
# We define a data collection function, and we use the ratelimit package
# to limit our function to 150 API calls / hour.
ONE_HOUR = 3600

@sleep_and_retry
@limits(calls=70, period=ONE_HOUR)
def call_fitbit_api(date):
    """ Call the Fitbit API for given date in format 'YYYY-MM-DD',
        Return tuple (activities, summary, steps) of dataframes """
    
    # Call API twice to get activities and steps responses
    activities_data = client.activities(date=date)
    steps_data = client.intraday_time_series('activities/steps', base_date=date, detail_level='1min')
        
    # Get activities dataset
    activities = activities_data['activities']
    activities = pd.DataFrame(activities)
    
    # Get summary dataset
    summary = activities_data['summary']
    del summary['distances']
    del summary['heartRateZones']
    summary = pd.DataFrame(summary, index=[0])
        
    # Get steps intraday dataset  
    steps = steps_data['activities-steps-intraday']['dataset']
    steps = pd.DataFrame(steps)
    
    # Add a date column
    activities['date'] = [date for i in activities.index]
    summary['date'] = [date]
    steps['date'] = [date for i in steps.index]
    
    return activities, summary, steps


def get_fitbit_data(date_range):
    """ Collect 'activities', 'summary' and 'steps' datasets over given dates
        Store as CSV files with format RESOURCE_DATE_to_DATE.csv """
    
    daily_df = {
        'activities': [],
        'summary': [],
        'steps': []
    }

    for date in date_range:
        # Call API and get three datasets
        activities, summary, steps = call_fitbit_api(date)
    
        # Append to previous datasets
        daily_df['activities'].append(activities)
        daily_df['summary'].append(summary)
        daily_df['steps'].append(steps)
        
    # Store total dataset as file with format "resource_DATE_to_DATE.csv"
    start, end = date_range[0], date_range[-1]

    for resource in daily_df:
        df = pd.concat(daily_df[resource], ignore_index=True)
        df.to_csv("./data/raw/{}_{}_to_{}.csv".format(resource, start, end), index=False)

In [None]:
# Collect Fitbit 'activities', 'summary' and 'steps' data since October 1st, 2019
get_fitbit_data(date_range=date_range)

## 2. Cleaning the data

It's time to take a look at each dataset.

### 2.1. The activity dataset

In [None]:
activities = pd.read_csv("./data/raw/activities_2019-10-01_to_2020-05-18.csv")
activities.head(3)

In [None]:
activities.shape

We have 16 columns, many of which contain logging information, True/False data or duplicate information which is not useful to us. Let's drop these.

In [None]:
drop_columns = ['activityId', 'activityParentId', 'activityParentName', 'hasStartTime', 
                'isFavorite', 'lastModified', 'logId', 'startDate']

activities.drop(drop_columns, axis=1, inplace=True)

Next, let's look at the distance column. Consulting the documentation, we see that this means logged distance. Since I've rarely used the feature, it looks like the column consists mostly of missing values.

In [None]:
activities.distance.value_counts()

Since we only have 2 non-missing values in 354 rows, let's drop the column.

In [None]:
activities.drop('distance', axis=1, inplace=True)

Some of the column names are in camelCase. Let's rename them to Python's favored snake_case.

In [None]:
activities.rename(columns={'startTime': 'start_time'}, inplace=True)
activities.head(3)

The duration column isn't easy to parse and is missing units. The Fitbit api [documentation](https://dev.fitbit.com/build/reference/web-api/activity/#activity-logging) lists the duration as being in millisecond, so let's put it in minutes and rename accordingly.

In [None]:
activities.duration = activities.duration.apply(lambda x: round(x/60000))
activities.rename(columns={'duration': 'duration_min'}, inplace=True)

activities.head(3)

To help with analysis, let's format the start_time column as "YYYY-MM-DD H:M:S" to more easily convert to a datetime object. Since we have the activity duration, we can also add an end_time column.

In [None]:
# Format start_time column and convert to datetime object
activities.start_time = activities.date + " " + activities.start_time + ":00"
activities.start_time = pd.to_datetime(activities.start_time)

# Create end_time column by adding the duration_min column to start_time
activities_duration = activities.duration_min.apply(lambda x: timedelta(minutes=x))
activities['end_time'] = activities.start_time + activities_duration

# Display result
activities.head(3)

Finally, let's reorder the columns for readability.

In [None]:
# Reorder columns
column_order = ['date', 'name', 'description', 'start_time', 'end_time', 'duration_min', 'steps', 'calories']
activities = activities[column_order]

# Store dataset
start, end = date_range[0], date_range[-1]
activities.to_csv("./data/tidy/activities_{}_to_{}.csv".format(start, end), index=False)

# Look at end result
activities

### 2.2. The summary dataset

Now let's take a look at the second dataset.

In [None]:
summary = pd.read_csv("./data/raw/summary_2019-10-01_to_2020-05-18.csv")
summary

Now, the activeScore column is added by the python-fitbit wrapper to the Fitbit API. All values are -1 in our dataset so there's not much loss of information in dropping the column.

In [None]:
(summary.activeScore == -1).all()

In [None]:
summary.drop('activeScore', axis=1, inplace=True)
summary.head(2)

Next, we again format all columns to snake_case and reorder for readability.

In [None]:
# Rename columns to snake_case
columns_map = {
    'activityCalories': 'activity_calories',
    'caloriesBMR': 'calories_BMR',
    'caloriesOut': 'calories_out',
    'fairlyActiveMinutes': 'fairly_active_minutes',
    'lightlyActiveMinutes': 'lightly_active_minutes',
    'marginalCalories': 'marginal_calories',
    'restingHeartRate': 'resting_heart_rate',
    'sedentaryMinutes': 'sedentary_minutes',
    'veryActiveMinutes': 'very_active_minutes'
}

summary.rename(columns=columns_map, inplace=True)

# Reorder columns
column_order = ['date', 'steps', 'very_active_minutes', 'fairly_active_minutes', 'lightly_active_minutes', 
                'sedentary_minutes', 'activity_calories', 'marginal_calories', 'calories_out', 'calories_BMR',
                'resting_heart_rate']

summary = summary[column_order]

# Store dataset
start, end = summary.date[0], summary.date[len(summary.index)-1]
summary.to_csv("./data/tidy/summary_{}_to_{}.csv".format(start, end), index=False)

# Look at result
summary.head(3)

### 2.3. The steps dataset

Finally, we look at the intraday steps dataset.

In [None]:
steps = pd.read_csv("./data/raw/steps_2019-10-01_to_2020-05-18.csv")
steps

We can combine the time and date into a single column, in datetime format. We also rename value to the more descriptive 'stepcount'.

In [None]:
# Combine date and time
steps.time = steps.date + " " + steps.time

# Rename value to stepcount
steps.rename(columns={'value': 'stepcount'}, inplace=True)

# Get endpoint dates to store the file
start, end = steps.date[0], steps.date[len(steps.index) - 1]

# Drop date column and store
steps.drop('date', axis=1, inplace=True)
steps.to_csv("./data/tidy/steps_{}_to_{}.csv".format(start, end), index=False)

# Look at end result
steps

## 3. Visualisations

### 3.1. Activity statistics per week day

Let's compile some statistics based on day of the week. First, let's take a look at summary data.

In [None]:
# Use parse_dates to interpret our date column as datetime objects
summary = pd.read_csv("./data/tidy/summary_2019-10-01_to_2020-05-18.csv", parse_dates=['date'])
summary

We can use strftime to convert the date to a week day, and get group statistics per day of the week.


In [None]:
# Add a weekday column
summary['weekday'] = summary.date.apply(lambda x: datetime.strftime(x, "%A"))

# Get statistics per day of the week
weekly_statistics = summary.groupby('weekday').describe()

# Row indices are days of the week, put them in order
row_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekly_statistics = weekly_statistics.loc[row_order, :]

# Show results
weekly_statistics

In [None]:
# Plot the mean and first quartile for number of steps per weekday
mean_steps = weekly_statistics.steps[['mean', '25%']]
mean_steps.plot(kind='bar')

plt.title('Weekly stepcount since October 1st, 2019')
plt.ylabel('steps')
plt.ylim([0, 18000])
plt.show()

### 3.2. Visualising walks over the day
Let's now look at the steps intraday data.

In [None]:
# Load in dataset
steps = pd.read_csv("./data/tidy/steps_2019-10-01_to_2020-05-18.csv", parse_dates=['time'])
steps

Let's visualise steps intraday data over a given day. We look at May 1st again.

In [None]:
date = '2020-05-01'

# Restrict to logs for given date
day_df = steps[steps.time.apply(lambda x: datetime.strftime(x, "%Y-%m-%d")) == date].copy()

# Restrict to within waking hours
start_of_day = pd.to_datetime('2020-05-01 07:00:00')
end_of_day = pd.to_datetime('2020-05-01 23:00:00')
day_df = day_df[(day_df.time >= start_of_day)&(day_df.time <= end_of_day)]

# Convert time back to hr:min:sec format and set as index
day_df.time = day_df.time.apply(lambda x: datetime.strftime(x, "%H:%M:%S"))
day_df.set_index('time', inplace=True)

Now let's plot steps during the day on May 1st.

In [None]:
# Plot steps on May 1st
fig, ax = plt.subplots()

day_df.rolling(15).mean().plot(ax=ax)  # 15 min rolling avg to smooth out noise
ax.set_title('Steps on May 1st, 2020')
ax.set_xlabel('Time of Day')
ax.set_ylabel('Steps per min')
plt.show()

Here we can tell which period corresponds to exercise, and which results from general activity, but let's be more systematic about this. We can isolate the steps that result from walks alone and not from general activity. The activity dataset has a start_time and end_time for each activity (walk, run, ...) and we may use these to filter our dataset.

In [None]:
# Load activities dataset, parsing start_time and end_time columns as datetime objects
time_col = ['start_time', 'end_time']
activities = pd.read_csv("./data/tidy/activities_2019-10-01_to_2020-05-18.csv", parse_dates=time_col)
activities.head(3)

Let's add a column named 'on_walk' to the steps dataset, with a True/False value. For this we cook up a helper function as below:

In [None]:
# Helper function to filter the intraday steps data by activity type
def is_during_activity(t, activity):
    """ Takes a datetime object t and activity name
        Returns True if during activity, else False """
    # Get the activities dataset for that day
    date = datetime.strftime(t, "%Y-%m-%d")
    df = activities[activities.date == date]
    
    # Subset to rows which represent activity
    df = df[df.name == activity]
    
    # Check if t is within the bounds of the activity
    for i in df.index:
        if df.loc[i, 'start_time'] <= t <= df.loc[i, 'end_time']:
            return True
    
    return False


# Add 'on_walk' column to steps dataframe
steps['on_walk'] = steps.time.apply(is_during_activity, args=('Walk',))

Let's take a look at the stepcount during walks.

In [None]:
steps[steps.on_walk == True]

Using this, we can create a new dataframe consisting of walks stepcount data.

In [None]:
# Set all steps outside of walks to zero
walks = steps.copy()
walks.stepcount = walks.stepcount.where(walks.on_walk == True, 0)
    
# Drop 'on_walk' column
walks.drop('on_walk', axis=1, inplace=True)

Let's look at May 1st again.

In [None]:
date = '2020-05-01'

# Restrict to logs for given date
day_walks = walks[walks.time.apply(lambda x: datetime.strftime(x, "%Y-%m-%d")) == date].copy()

# Restrict to within waking hours
start_of_day = pd.to_datetime('2020-05-01 07:00:00')
end_of_day = pd.to_datetime('2020-05-01 23:00:00')
day_walks = day_walks[(day_walks.time >= start_of_day)&(day_walks.time <= end_of_day)]

# Convert time back to hr:min:sec format and set as index
day_walks.time = day_walks.time.apply(lambda x: datetime.strftime(x, "%H:%M:%S"))
day_walks.set_index('time', inplace=True)

In [None]:
# Plot walks on May 1st
fig, ax = plt.subplots()

day_walks.rolling(15).mean().plot(ax=ax)  # 15 min rolling avg to smooth out noise
ax.set_title('Steps on May 1st 2020 during a walk')
ax.set_xlabel('Time of Day')
ax.set_ylabel('Steps per min')

plt.show()

#### Visualise walk times for each day of the week.

We can build a picture of the 'average' day over the last 5 months, broken down by day of the week.

In [None]:
# Add a weekday column to walks dataset for grouping
walks['weekday'] = walks.time.apply(lambda x: datetime.strftime(x, "%A"))
walks

To build our daily picture, let's first group the dataset by day of the week, then average the stepcount for each given minute. This should give us a sense of the distribution of walks on each day.

In [None]:
# change date column to hour:min strings for grouping
walks.time = walks.time.apply(lambda x: datetime.strftime(x, "%H:%M"))

# for each day of the week, average step count over all dates
walks_weekday = walks.groupby('weekday') 

weekdays = {}
for day_name, df in walks_weekday:
    # group by minute, then average over dates
    df = df.groupby('time').mean()
    weekdays[day_name] = df

We can also get rid of the timestamps during the night, since I'm not up for midnight walks too often.

In [None]:
# Restrict to waking hours, say 7:00am to 23:59pm
for day in weekdays:
    weekdays[day] = weekdays[day].iloc[420:]

Now, let's look at the distribution of walks on Mondays.

In [None]:
weekdays['Monday'].rolling(15).mean().plot()
plt.show()

Finally, we do this for each day of the week separately.

In [None]:
# Plot each day of the week
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

fig1, axes1 = plt.subplots(1, 5, figsize=(25, 5))
fig2, axes2 = plt.subplots(1, 2, figsize=(25, 5))

# Plot Monday-Friday first
for i in range(5):
    # Take 15min rolling average
    df = weekdays[days[i]].rolling(15).mean()
    
    # Relabel
    df.rename(columns={'stepcount': 'steps/min'}, inplace=True)
    
    # Plot day
    df.plot(ax=axes1[i])
    axes1[i].set_title(days[i])
    axes1[i].set_xlabel("Time of Day")
    
# Then plot Saturday-Sunday
for i in range(2):
    # Take 15 min rolling average
    df = weekdays[days[5+i]].rolling(15).mean()
    
    # Relabel
    df.rename(columns={'stepcount': 'steps/min'}, inplace=True)

    # Plot day
    df.plot(ax=axes2[i])
    axes2[i].set_title(days[5+i])
    axes2[i].set_xlabel("Time of Day")