In [3]:
import json
import datetime
import pandas as pd
from geopy import distance #will likely have to be installed separately

I started by finding the latitude and longitude boundaries of campus. I will draw a circular boundary from the midpoint of campus that should encompass all parts of campus with a bit of leeway. Even though this will extend slightly outside the range of campus it should be close enough since most of the extra area included is woods or streets I don't spend time on, so it shouldn't add extra data that isn't relevant. I used a website to help me locate where this circle is and its radius. I found a midpoint at: 38.990262, -76.943013 with a radium of 986.61 meters.

I started by finding the latitude and longitude boundaries campus, and the surrounding areas in College Park (basically from the apartments off of South Campus up north to where North Campus ends. And then far East enough to reach my home, and encompassing campas in the West). Using google maps I found that this area stretches over latitudes: 38.977086 to 39.000676 and it stretches over longitudes: -76.925135 to -76.953067. Since I plan to calculate travel times only in the campus area, I want to use only travel segments that start and end within these latitude and longitude ranges. We'll have to pick out these lat/long data from our JSON data that was downloaded from Google. <br>
Using a JSON tree plugin for Notepad++ I got a good idea of how the data is organized. The top level of JSON is a single object named "timelineObjects." This contains an array of all the data collected by Google as a sequence of numbers. Each of these objects can be either a "placeVisit" or an "activitySegment." These two are basically interwoven with each other since each placeVisit is a location that a user ended a trip in, and each activitySegment represents a distance traveled by a user. <br>
Each of these objects have their own elements. activitySegment has: startLocation, endLocation, duration, distance, activityType, confidence, activities, waypointPath, and simplifiedRawPath. Most of these are self-explanatory, but the remainder are actually simple as well. activity is the method of transportation (cycling, in_passenger_vehicle, train, etc.), confidence is how confident Google is that this method of transportation predicted is correct (can be HIGH, MEDIUM, etc.). activities contains the list of possible activities and the probability that Google thinks you were conducting this specific activity (essentially a further breakdown of 'activity'). waypointPath is the lat/long coordinates of each waypoint google has marked along your journey. simplifiedRawPath seems to be just waypointPath with less waypoints, but I'm not entirely sure what it represents. <br>
placeVisit has elements: location, duration, placeConfidence, childVisits, visitConfidence, otherCandidateLocations, placeVisitType, and a few others. These are mostly self explanatory as well. placeConfidence is how likely Google thinks that the location given is correct (I assume this is done since GPS isn't hyper-accurate, so they try to predict exact location). childVisits is more local locations that may have been visited (like individual shops within a mall). placeVisitType is a value like SINGLE_PLACE. <br>
Many of these attributes for activitySegment will be useful like startLocation, endLocation, duration, distance, acitvityType, and confidence. For placeVisit, I don't expect to need this data, however, it may turn out to be useful for determining which activitySegments ended on campus by checking of the placeVisit Location is a building on campus or campus itself 

In [134]:
# create a list of all files using loops cuz I am lazy
months = ["JANUARY","FEBRUARY","MARCH","APRIL","MAY","JUNE","JULY","AUGUST","SEPTEMBER","OCTOBER","NOVEMBER","DECEMBER"]
years = ["2019_","2020_","2021_","2022_"]
fileList = []
for year in years:
    for month in months:
        fileList.append(f"{year}{month}")
# get rid of the extra files that account for months that are pre-August 2019, and post-May 2022    
fileList = fileList[7:41]
        
# Create a dataframe, and then append each file's data onto it by looping through them
# This takes a good 5 seconds on my PC
timelineData = pd.DataFrame()
for file in fileList:
    # Load the data in with loads() to be ready to convert to a dataframe
    with open(json_file,'r') as f:
        data_temp = json.loads(f.read())
    # Then use pd.json_normalize to flatten the data since it was in nested JSON object
    # record_path tells it where to look at data to normalize
    df_temp = pd.json_normalize(data, record_path = ['timelineObjects'])
    timelineData = pd.concat([timelineData, df_temp], ignore_index = True)
timelineData

Unnamed: 0,activitySegment.startLocation.latitudeE7,activitySegment.startLocation.longitudeE7,activitySegment.startLocation.sourceInfo.deviceTag,activitySegment.endLocation.latitudeE7,activitySegment.endLocation.longitudeE7,activitySegment.endLocation.sourceInfo.deviceTag,activitySegment.duration.startTimestamp,activitySegment.duration.endTimestamp,activitySegment.distance,activitySegment.activityType,...,activitySegment.simplifiedRawPath.source,activitySegment.simplifiedRawPath.distanceMeters,activitySegment.transitPath.transitStops,activitySegment.transitPath.name,activitySegment.transitPath.hexRgbColor,activitySegment.transitPath.linePlaceId,activitySegment.transitPath.stopTimesInfo,activitySegment.transitPath.source,activitySegment.transitPath.confidence,activitySegment.transitPath.distanceMeters
0,389961925.0,-769280436.0,-1.787875e+09,389851415.0,-769475366.0,-1.787875e+09,2022-04-01T14:51:59.318Z,2022-04-01T15:07:35.066Z,2133.0,CYCLING,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,389860599.0,-769469002.0,-1.787875e+09,389960539.0,-769284190.0,-1.787875e+09,2022-04-01T15:46:53.601Z,2022-04-01T15:54:39.485Z,2062.0,CYCLING,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,389959961.0,-769283521.0,-1.787875e+09,389961894.0,-769285198.0,-1.787875e+09,2022-04-01T16:41:48.852Z,2022-04-01T17:01:58.680Z,5889.0,IN_PASSENGER_VEHICLE,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10093,,,,,,,,,,,...,,,,,,,,,,
10094,389868494.0,-769415102.0,-1.787875e+09,389804719.0,-769399685.0,-1.787875e+09,2022-04-30T19:52:22.669Z,2022-04-30T19:59:44.499Z,756.0,WALKING,...,,,,,,,,,,
10095,,,,,,,,,,,...,,,,,,,,,,
10096,389817136.0,-769404169.0,-1.787875e+09,389961196.0,-769283715.0,-1.787875e+09,2022-04-30T23:45:48.903Z,2022-04-30T23:52:56.976Z,2035.0,IN_PASSENGER_VEHICLE,...,,,,,,,,,,


This data obviously needs a lot of clean-up. I saw some libraries like Glom that could've been used to extract data more nicely, but since I only need a few attributes it's simple enough to just clean up the data after putting it in the dataframe.

In [135]:
# Start by dropping all of the placeVisits by getting rid of rows that don't have data in activitySegment columns
timelineData.dropna(subset=['activitySegment.startLocation.latitudeE7'],inplace=True)

# Next include only the columns we actually care about
timelineDataClean = timelineData[timelineData.columns[[0,1,3,4,6,7,8,9,10]]]

# Finally let's rename these columns
timelineDataClean.set_axis(['startLat', 'startLng', 'endLat', 'endLng', 'startTime', 'endTime', 'dist', 'activity', 'confidence'], axis=1, inplace=True)
timelineDataClean

Unnamed: 0,startLat,startLng,endLat,endLng,startTime,endTime,dist,activity,confidence
0,389961925.0,-769280436.0,389851415.0,-769475366.0,2022-04-01T14:51:59.318Z,2022-04-01T15:07:35.066Z,2133.0,CYCLING,HIGH
2,389860599.0,-769469002.0,389960539.0,-769284190.0,2022-04-01T15:46:53.601Z,2022-04-01T15:54:39.485Z,2062.0,CYCLING,HIGH
4,389959961.0,-769283521.0,389961894.0,-769285198.0,2022-04-01T16:41:48.852Z,2022-04-01T17:01:58.680Z,5889.0,IN_PASSENGER_VEHICLE,MEDIUM
6,389962056.0,-769287358.0,389881245.0,-769398750.0,2022-04-01T17:56:50.999Z,2022-04-01T18:01:47.945Z,1429.0,CYCLING,MEDIUM
8,389885651.0,-769401921.0,389960442.0,-769284344.0,2022-04-01T18:41:11.296Z,2022-04-01T18:52:59.762Z,1313.0,CYCLING,MEDIUM
...,...,...,...,...,...,...,...,...,...
10088,389828505.0,-769480967.0,389839017.0,-769485749.0,2022-04-30T00:49:22.391Z,2022-04-30T00:50:23.606Z,134.0,WALKING,HIGH
10090,389855996.0,-769489810.0,389959959.0,-769284153.0,2022-04-30T04:03:03.492Z,2022-04-30T04:08:46.399Z,2757.0,IN_PASSENGER_VEHICLE,HIGH
10092,389961471.0,-769287037.0,389867376.0,-769413156.0,2022-04-30T13:22:57Z,2022-04-30T13:27:44.879Z,1729.0,CYCLING,HIGH
10094,389868494.0,-769415102.0,389804719.0,-769399685.0,2022-04-30T19:52:22.669Z,2022-04-30T19:59:44.499Z,756.0,WALKING,LOW


Now that we've got our data cleaned up, we can start filtering it to ensure that we're only including data that takes place around UMD campus. In addition I'm going to drop any LOW confidence activities since Google doesn't display these in my account's timeline history. I don't believe that dropping these datapoints should skew the data since I expect the low confidence is due to bad GPS data, not due to any factors I plan to analyze such as duration or distance. 

In [137]:
# First let's ensure we only look at start and end locations that are on the campus
# This means filtering by latitudes between 38.977086 to 39.000676 and longitudes between -76.925135 to -76.953067
# These coordinates are represented without a decimal point though, so in reality we need to multiply these numbers by 10^7

criterian = (timelineDataClean['startLat'] < 390000676) & (timelineDataClean['startLat'] > 389770860)
criterian = criterian & (timelineDataClean['endLat'] < 390000676) & (timelineDataClean['endLat'] > 389770860)
criterian = criterian & (timelineDataClean['startLng'] < -769251350) & (timelineDataClean['startLng'] > -769530670)
criterian = criterian & (timelineDataClean['endLng'] < -769251350) & (timelineDataClean['endLng'] > -769530670)

# We also only want activitySegments where the confidence level is at least MEDIUM
criterian = criterian & ( (timelineDataClean['confidence'] == "MEDIUM") | (timelineDataClean['confidence'] == "HIGH") )

finalData = timelineDataClean[criterian]
finalData
#timelineDataClean[[(x == "LOW") for x in timelineDataClean['confidence']]]



Unnamed: 0,startLat,startLng,endLat,endLng,startTime,endTime,dist,activity,confidence
0,389961925.0,-769280436.0,389851415.0,-769475366.0,2022-04-01T14:51:59.318Z,2022-04-01T15:07:35.066Z,2133.0,CYCLING,HIGH
2,389860599.0,-769469002.0,389960539.0,-769284190.0,2022-04-01T15:46:53.601Z,2022-04-01T15:54:39.485Z,2062.0,CYCLING,HIGH
4,389959961.0,-769283521.0,389961894.0,-769285198.0,2022-04-01T16:41:48.852Z,2022-04-01T17:01:58.680Z,5889.0,IN_PASSENGER_VEHICLE,MEDIUM
6,389962056.0,-769287358.0,389881245.0,-769398750.0,2022-04-01T17:56:50.999Z,2022-04-01T18:01:47.945Z,1429.0,CYCLING,MEDIUM
8,389885651.0,-769401921.0,389960442.0,-769284344.0,2022-04-01T18:41:11.296Z,2022-04-01T18:52:59.762Z,1313.0,CYCLING,MEDIUM
...,...,...,...,...,...,...,...,...,...
10084,389867763.0,-769412799.0,389960635.0,-769284629.0,2022-04-29T23:17:29.962Z,2022-04-29T23:22:57.123Z,1653.0,IN_PASSENGER_VEHICLE,MEDIUM
10088,389828505.0,-769480967.0,389839017.0,-769485749.0,2022-04-30T00:49:22.391Z,2022-04-30T00:50:23.606Z,134.0,WALKING,HIGH
10090,389855996.0,-769489810.0,389959959.0,-769284153.0,2022-04-30T04:03:03.492Z,2022-04-30T04:08:46.399Z,2757.0,IN_PASSENGER_VEHICLE,HIGH
10092,389961471.0,-769287037.0,389867376.0,-769413156.0,2022-04-30T13:22:57Z,2022-04-30T13:27:44.879Z,1729.0,CYCLING,HIGH


In [140]:
cycling = finalData[finalData['activity'] == "CYCLING"]
driving = finalData[finalData['activity'] == "IN_PASSENGER_VEHICLE"]
walking = finalData[finalData['activity'] == "WALKING"]
print(cycling.count())
print(driving.count())
print(walking.count())

startLat      1768
startLng      1768
endLat        1768
endLng        1768
startTime     1768
endTime       1768
dist          1768
activity      1768
confidence    1768
dtype: int64
startLat      1156
startLng      1156
endLat        1156
endLng        1156
startTime     1156
endTime       1156
dist          1156
activity      1156
confidence    1156
dtype: int64
startLat      136
startLng      136
endLat        136
endLng        136
startTime     136
endTime       136
dist          136
activity      136
confidence    136
dtype: int64


break cell:

In [4]:
#all of the below taken from: https://medium.com/@stephen.hogg.sh/interrogating-googles-timeline-data-ca22c3a9fd2c
start_year, start_month, start_day = 2019, 8, 22
end_year, end_month, end_day = 2022, 5, 7
point_of_interest = (38.9869, -76.9426)  # Latitude, Longitude
distance_from_point_of_interest = 0.986 # km of distance
aggregate_by = 'Day'   # This can be 'Year', 'Month', 'Day', 'Hour', 'Second'
# weekday_exceptions = ['Saturday','Sunday'] # A list of weekdays that should excluded.

# Reformat the start/end datetimes to epoch times in milliseconds
start_datetime = datetime.datetime(start_year,start_month,start_day).timestamp()*1e3
end_datetime = datetime.datetime(end_year,end_month,end_day).timestamp()*1e3

# Setup the aggregate list
aggregate_options = ['Year','Month','Day','Hour','Second']
aggregate_list = aggregate_options[:aggregate_options.index(aggregate_by)+1]

In [6]:
# We'll start with just April 2022 data to test out the timeline data
json_file = "2022_APRIL.json"
data = json.load(open(json_file))   # Read the file
# locations = data['locations']

In [19]:
data

{'timelineObjects': [{'activitySegment': {'startLocation': {'latitudeE7': 389961925,
     'longitudeE7': -769280436,
     'sourceInfo': {'deviceTag': -1787875010}},
    'endLocation': {'latitudeE7': 389851415,
     'longitudeE7': -769475366,
     'sourceInfo': {'deviceTag': -1787875010}},
    'duration': {'startTimestamp': '2022-04-01T14:51:59.318Z',
     'endTimestamp': '2022-04-01T15:07:35.066Z'},
    'distance': 2133,
    'activityType': 'CYCLING',
    'confidence': 'HIGH',
    'activities': [{'activityType': 'CYCLING',
      'probability': 78.75603437423706},
     {'activityType': 'IN_PASSENGER_VEHICLE', 'probability': 9.31272879242897},
     {'activityType': 'WALKING', 'probability': 5.101209878921509},
     {'activityType': 'MOTORCYCLING', 'probability': 3.6667339503765106},
     {'activityType': 'STILL', 'probability': 1.1762714013457298},
     {'activityType': 'IN_BUS', 'probability': 1.0320750065147877},
     {'activityType': 'IN_TRAIN', 'probability': 0.40568551048636436},
  