# Project Goal: Analyzing Personal Cycling Activity from Google Location History

The goal of this project is to extract, clean, and analyze my **cycling activity data** from Google Timeline Location History (Takeout data). 

Using structured filtering based on **activity type**, **confidence level**, and **timestamp**, the project aims to:

- Identify **reliable cycling trips** (medium/high confidence)
- Focus on activity between **February and May 2023**
- Convert and enrich data for **distance**, **calendar**, and **weekday analysis**
- Understand patterns in **daily and weekly cycling behavior**
- Enable clear **visualization and insights** from personal mobility data

This analysis helps quantify real-world cycling habits and supports personal fitness tracking using raw location data.

## Step 1: Download Required Libraries

We begin by importing the necessary Python libraries for:

- Working with ZIP archives
- Loading and manipulating structured data
- Parsing date and time information

In [169]:
from zipfile import ZipFile
import pandas as pd
import json
from datetime import datetime

## Step 2: Set File Path for Raw Data

Set the working directory to the location where your extracted Google Takeout files are stored.

In [172]:
%cd "C:/Users/rugge/Dropbox/Personal Portfolio/Fitness/Cycling Data/Raw data"

C:\Users\rugge\Dropbox\Personal Portfolio\Fitness\Cycling Data\Raw data


This script processes your Google Location History takeout ZIP file to extract cycling-specific segments. It performs the following steps:

1. **Open** the ZIP file containing your Google takeout data.
2. **Search** for files under the "Semantic Location History" folder.
3. **Parse** the JSON contents to find `activitySegment` entries.
4. **Filter** only those segments that:
   - Include a `distance` field
   - Include a `confidence` level
5. **Extract** key attributes from each cycling segment:
   - Activity type and its probability
   - Distance traveled
   - Confidence level
   - Start and end timestamps
6. **Append** each valid cycling activity to a list called `cycle_trips`.

The final result is a list called `cycle_trips` that contains structured dictionaries, one per cycling trip, ready for analysis.

In [175]:
#### Before I go any further, this step below borrows heavily from the link below and I have obviously adapted it for my use 
### case but credit must go where credit is due to Maksym Kozlenko and his amazing illustration of how to easily import and
### load your Google Timeline Location History data. Please see the link below 
### https://betterprogramming.pub/loading-location-history-places-from-google-timeline-into-pandas-and-csv-c26cb0ac5e89


# path to Google Location History takeout file
history_data_file = "takeout-20230915T180410Z-001.zip"

# store all places into this array
cycle_trips = []

# Import the zip file and unzip the Google Timeline location history JSON files
with ZipFile(history_data_file) as myzip:
    for file in myzip.filelist[:]:
        filename = file.filename

# We want to look for the "Semantic Location History" files that contains the cycling distance data that we want to analyse below
        if "Semantic Location History" in filename:
            # process all files in "Semantic Location History" directory
            history_json = json.load(myzip.open(filename))
            
            # We need to locate the timeline object within the JSON file that contains the cycling distance data
            for timeline_object in history_json["timelineObjects"]:   
             
 # The key timeline object is "activitySegment" and we set up an object which basically contains the data stored as a dictionary
                if "activitySegment" in timeline_object:
                    cycle_trips_json = timeline_object["activitySegment"]                 
                    
# skip records where there are missing values for distance and if Google cannot record a 'LOW','MEDIUM'.'HIGH' categorisation for
# your recorded activity      
                    if not "distance" in cycle_trips_json or not 'confidence' in cycle_trips_json:
                        continue

                    activity_data = {
                        'Activity_Type': cycle_trips_json['activities'][0]['activityType'],
                        "Activity_Type_probability": cycle_trips_json['activities'][0]['probability'],
                        "distance": cycle_trips_json['distance'],
                        "confidence": cycle_trips_json['confidence'],
                        "startTimestamp": cycle_trips_json["duration"]["startTimestamp"],
                        "endTimestamp": cycle_trips_json["duration"]["endTimestamp"],
                    }                    
                    cycle_trips.append(activity_data)

Finally, this list (`cycle_trips`) is converted into a Pandas DataFrame named `cycle_trips_df`, and a working copy is saved as `df` for further analysis.


In [177]:
cycle_trips_df = pd.DataFrame(cycle_trips)

In [178]:
df = cycle_trips_df.copy()

In [179]:
df

Unnamed: 0,Activity_Type,Activity_Type_probability,distance,confidence,startTimestamp,endTimestamp
0,CYCLING,62.95,1131,MEDIUM,2020-05-18T18:00:09.211Z,2020-05-18T18:16:18.919Z
1,CYCLING,82.46,944,HIGH,2020-05-18T19:34:05.079Z,2020-05-18T19:44:14.080Z
2,CYCLING,82.52,1261,HIGH,2020-05-19T09:20:20.388Z,2020-05-19T10:07:08.805Z
3,CYCLING,88.82,1133,HIGH,2020-05-19T13:28:45.485Z,2020-05-19T13:49:58.977Z
4,CYCLING,69.85,2124,MEDIUM,2020-05-20T07:25:32.727Z,2020-05-20T08:14:30.274Z
...,...,...,...,...,...,...
10646,WALKING,0.00,1101,LOW,2015-01-17T22:46:19.429Z,2015-01-19T10:37:09.002Z
10647,UNKNOWN_ACTIVITY_TYPE,0.00,1828,UNKNOWN_CONFIDENCE,2015-01-20T16:53:06.911Z,2015-01-21T20:26:28.553Z
10648,CYCLING,0.00,613,LOW,2015-01-21T20:26:28.553Z,2015-01-21T20:28:28.625Z
10649,IN_VEHICLE,0.00,3719,LOW,2015-01-27T19:17:14.586Z,2015-01-27T21:04:03.619Z


In [180]:
# This removes timezone info from datetime columns and trims to the nearest second
# Parse ISO 8601 timestamps, remove time zone info, round to nearest second
df['startTimestamp'] = pd.to_datetime(df['startTimestamp'], format='ISO8601').dt.tz_localize(None).dt.floor('s')
df['endTimestamp'] = pd.to_datetime(df['endTimestamp'], format='ISO8601').dt.tz_localize(None).dt.floor('s')

df

Unnamed: 0,Activity_Type,Activity_Type_probability,distance,confidence,startTimestamp,endTimestamp
0,CYCLING,62.95,1131,MEDIUM,2020-05-18 18:00:09,2020-05-18 18:16:18
1,CYCLING,82.46,944,HIGH,2020-05-18 19:34:05,2020-05-18 19:44:14
2,CYCLING,82.52,1261,HIGH,2020-05-19 09:20:20,2020-05-19 10:07:08
3,CYCLING,88.82,1133,HIGH,2020-05-19 13:28:45,2020-05-19 13:49:58
4,CYCLING,69.85,2124,MEDIUM,2020-05-20 07:25:32,2020-05-20 08:14:30
...,...,...,...,...,...,...
10646,WALKING,0.00,1101,LOW,2015-01-17 22:46:19,2015-01-19 10:37:09
10647,UNKNOWN_ACTIVITY_TYPE,0.00,1828,UNKNOWN_CONFIDENCE,2015-01-20 16:53:06,2015-01-21 20:26:28
10648,CYCLING,0.00,613,LOW,2015-01-21 20:26:28,2015-01-21 20:28:28
10649,IN_VEHICLE,0.00,3719,LOW,2015-01-27 19:17:14,2015-01-27 21:04:03


## Extract Date Components from Timestamps

We extract several useful components from the `endTimestamp` column to support temporal filtering and grouping later in the analysis:

- `day`: Numerical day of the month (e.g., 18)
- `month`: Month number (e.g., 5 for May)
- `year`: Full year (e.g., 2020)
- `date`: Date only (no time)
- `day_name`: Name of the weekday (e.g., Monday, Tuesday)

These new columns will make it easier to filter by time range, summarize by calendar units, and create weekday-based insights.

In [182]:
df['day'] = df['endTimestamp'].dt.day
df['month'] = df['endTimestamp'].dt.month
df['year'] = df['endTimestamp'].dt.year
df['date'] = df['endTimestamp'].dt.date
df['day_name'] = df['endTimestamp'].dt.day_name()

In [183]:
df

Unnamed: 0,Activity_Type,Activity_Type_probability,distance,confidence,startTimestamp,endTimestamp,day,month,year,date,day_name
0,CYCLING,62.95,1131,MEDIUM,2020-05-18 18:00:09,2020-05-18 18:16:18,18,5,2020,2020-05-18,Monday
1,CYCLING,82.46,944,HIGH,2020-05-18 19:34:05,2020-05-18 19:44:14,18,5,2020,2020-05-18,Monday
2,CYCLING,82.52,1261,HIGH,2020-05-19 09:20:20,2020-05-19 10:07:08,19,5,2020,2020-05-19,Tuesday
3,CYCLING,88.82,1133,HIGH,2020-05-19 13:28:45,2020-05-19 13:49:58,19,5,2020,2020-05-19,Tuesday
4,CYCLING,69.85,2124,MEDIUM,2020-05-20 07:25:32,2020-05-20 08:14:30,20,5,2020,2020-05-20,Wednesday
...,...,...,...,...,...,...,...,...,...,...,...
10646,WALKING,0.00,1101,LOW,2015-01-17 22:46:19,2015-01-19 10:37:09,19,1,2015,2015-01-19,Monday
10647,UNKNOWN_ACTIVITY_TYPE,0.00,1828,UNKNOWN_CONFIDENCE,2015-01-20 16:53:06,2015-01-21 20:26:28,21,1,2015,2015-01-21,Wednesday
10648,CYCLING,0.00,613,LOW,2015-01-21 20:26:28,2015-01-21 20:28:28,21,1,2015,2015-01-21,Wednesday
10649,IN_VEHICLE,0.00,3719,LOW,2015-01-27 19:17:14,2015-01-27 21:04:03,27,1,2015,2015-01-27,Tuesday


## Summary: Filtering and Preparing Cycling Data for Analysis

After extracting and converting your cycling trip data into a Pandas DataFrame (`df`), this section prepares the data for analysis by:

1. **Filtering** only cycling activities (`Activity_Type == 'CYCLING'`).
2. **Excluding** entries with low confidence by keeping only those with `"MEDIUM"` or `"HIGH"` confidence levels.
3. **Selecting** trips that occurred in the year **2023**, specifically during the months **February to May**.
4. **Resetting** the DataFrame index for a clean view.
5. **Converting** distances from **meters to kilometers** by dividing the `distance` column by 1000.

The resulting dataset contains clean, high-confidence cycling activities from Spring 2023, with distance shown in kilometers — ready for analysis or visualization.


In [185]:
onlycycling = df['Activity_Type'] == 'CYCLING'
not_non_or_low_confidence = df['confidence'].isin(['MEDIUM','HIGH'])
year_2023 = df['year'] == 2023
feb_to_may = df['month'].isin([2,3,4,5])

In [186]:
df = df.loc[onlycycling & not_non_or_low_confidence & year_2023 & feb_to_may].reset_index(drop=True)

In [187]:
df

Unnamed: 0,Activity_Type,Activity_Type_probability,distance,confidence,startTimestamp,endTimestamp,day,month,year,date,day_name
0,CYCLING,87.31,2418,HIGH,2023-05-01 06:42:25,2023-05-01 07:04:02,1,5,2023,2023-05-01,Monday
1,CYCLING,97.04,946,HIGH,2023-05-01 13:41:49,2023-05-01 13:51:20,1,5,2023,2023-05-01,Monday
2,CYCLING,99.27,5599,HIGH,2023-05-01 13:59:43,2023-05-01 14:30:28,1,5,2023,2023-05-01,Monday
3,CYCLING,99.05,5400,HIGH,2023-05-01 15:40:18,2023-05-01 16:15:45,1,5,2023,2023-05-01,Monday
4,CYCLING,96.38,1015,HIGH,2023-05-01 16:32:35,2023-05-01 16:41:59,1,5,2023,2023-05-01,Monday
...,...,...,...,...,...,...,...,...,...,...,...
427,CYCLING,97.60,1119,HIGH,2023-03-30 18:11:15,2023-03-30 18:24:21,30,3,2023,2023-03-30,Thursday
428,CYCLING,96.41,1140,HIGH,2023-03-30 19:37:14,2023-03-30 19:47:02,30,3,2023,2023-03-30,Thursday
429,CYCLING,93.84,291,HIGH,2023-03-30 19:55:04,2023-03-30 19:57:35,30,3,2023,2023-03-30,Thursday
430,CYCLING,95.16,1606,HIGH,2023-03-31 11:47:32,2023-03-31 11:57:20,31,3,2023,2023-03-31,Friday


## Summary: Convert Distance from Meters to Kilometers

To make the distance values easier to interpret and analyze, we convert the `distance` column from **meters** to **kilometers** 

In [189]:
df['distance'] = df['distance']/1000
df

Unnamed: 0,Activity_Type,Activity_Type_probability,distance,confidence,startTimestamp,endTimestamp,day,month,year,date,day_name
0,CYCLING,87.31,2.42,HIGH,2023-05-01 06:42:25,2023-05-01 07:04:02,1,5,2023,2023-05-01,Monday
1,CYCLING,97.04,0.95,HIGH,2023-05-01 13:41:49,2023-05-01 13:51:20,1,5,2023,2023-05-01,Monday
2,CYCLING,99.27,5.60,HIGH,2023-05-01 13:59:43,2023-05-01 14:30:28,1,5,2023,2023-05-01,Monday
3,CYCLING,99.05,5.40,HIGH,2023-05-01 15:40:18,2023-05-01 16:15:45,1,5,2023,2023-05-01,Monday
4,CYCLING,96.38,1.01,HIGH,2023-05-01 16:32:35,2023-05-01 16:41:59,1,5,2023,2023-05-01,Monday
...,...,...,...,...,...,...,...,...,...,...,...
427,CYCLING,97.60,1.12,HIGH,2023-03-30 18:11:15,2023-03-30 18:24:21,30,3,2023,2023-03-30,Thursday
428,CYCLING,96.41,1.14,HIGH,2023-03-30 19:37:14,2023-03-30 19:47:02,30,3,2023,2023-03-30,Thursday
429,CYCLING,93.84,0.29,HIGH,2023-03-30 19:55:04,2023-03-30 19:57:35,30,3,2023,2023-03-30,Thursday
430,CYCLING,95.16,1.61,HIGH,2023-03-31 11:47:32,2023-03-31 11:57:20,31,3,2023,2023-03-31,Friday


## 📈 Summary: Aggregating Daily Cycling Distance

This section summarizes your cleaned cycling activity data by calculating the **total distance cycled per day of the week**.

### ✅ Steps Performed:

1. **Aggregate** the `distance` column using `.sum()` to get the total distance per weekday.
3. **Normalize** by dividing total weekday distances by `17` (number of weeks or occurrences) to compute an **average distance per weekday**.


In [202]:
weekday_analysis = df.groupby('day_name')['distance'].sum().reset_index()
weekday_analysis

Unnamed: 0,day_name,distance
0,Friday,197.33
1,Monday,147.61
2,Saturday,96.17
3,Sunday,89.17
4,Thursday,118.47
5,Tuesday,135.96
6,Wednesday,173.05


In [204]:
weekday_analysis['avg_day_distance'] = weekday_analysis['distance']/17

In [206]:
weekday_analysis

Unnamed: 0,day_name,distance,avg_day_distance
0,Friday,197.33,11.61
1,Monday,147.61,8.68
2,Saturday,96.17,5.66
3,Sunday,89.17,5.25
4,Thursday,118.47,6.97
5,Tuesday,135.96,8.0
6,Wednesday,173.05,10.18


In [218]:
print('The average distance biked on any given day over this period was', (weekday_analysis['distance'].sum()/116).round(2))

The average distance biked on any given day over this period was 8.26
