# Transform

In this notebook we're going to Transform GPX files stored in Digital Ocean spaces.  If you notice, this project is missing an extract step. That's because I have a process setup to upload Equilab data using a Shotrcut on my phone, which is essentially my extract process. Equilab doens't have an API, so I can't extract the data in the traditional sense.

Before we get to the fun stuff first we must import:

## Import

In [39]:
import bucketstore
from geopy import distance as geodistance
import geopy
import gpxpy
import yaml
import json


## Setup

With importing out of the way we'll get a few basics out of the way. The below cell:
 1) Loads a file containing connection info to my DO Space
 2) Uses the connection info to connect to the space
 3) Loads the bucket that all of the data is stored in.

In [40]:
with open("secrets.yml", 'r') as ymlfile:
    cfg = yaml.safe_load(ymlfile)

bucketstore.login(
    access_key_id=cfg['spaces']['access'],
    secret_access_key=cfg['spaces']['secret'],
    region='nyc3',
    endpoint_url=cfg['spaces']['url']
)
bucket = bucketstore.get('wrathalake')

ridesKey = 'intermediate/rides.json'
processed = False

## Functions

To keep the code clean we're going to encapsulate a few things in functions.

In [41]:
def processFile(key):
    gpx = gpxpy.parse(bucket[key])
    
    # Choosing to store coordinates in a dict to remove ambiguity. Apparently we can't all agree to use long, lat or lat, long
    data = [{
        'time': point.time,
        'coords': {
            'long': point.longitude,
            'lat': point.latitude
        },
        'elevation': point.elevation * 3.28084
    } for point in gpx.tracks[0].segments[0].points]
    
    # Calculating some ride metrics
    totalDistance = 0
    totalTime = (data[-1]['time'] - data[0]['time']).total_seconds()
    totalClimb = 0
    
    prv = nxt = None
    l = len(data)
    
    for index, obj in enumerate(data):
        obj['index'] = index
        
        if index > 0:
            prv = data[index - 1]

        if index < (l - 1):
            nxt = data[index + 1]

        if prv is not None:
            timeDelta = (obj['time'] - prv['time']).total_seconds()
            
            # geopy uses coordinates in the (lat, long) format, so we'll create that below.
            distance = geodistance.geodesic((obj['coords']['lat'], obj['coords']['long']), (prv['coords']['lat'], prv['coords']['long'])).ft
            totalDistance += distance
            
            speed = (distance / timeDelta) * 0.681818182
            
            if obj['elevation'] > prv['elevation']:
                climb = obj['elevation'] - prv['elevation']
                totalClimb += climb
            else:
                climb = None

            obj['timeDelta'] = timeDelta
            obj['distance'] = distance
            obj['speed'] = speed
            obj['climb'] = climb
            obj['drop'] = (prv['elevation'] - obj['elevation']) if obj['elevation'] < prv['elevation'] else None
            
        
    rideData = {
        'rawFile': key,
        'interFile': key.replace('raw', 'intermediate').replace('gpx', 'json'),
        'rideDate': data[0]['time'].strftime("%Y%m%d"),
        'totalTime': totalTime,
        'totalDistance': totalDistance,
        'totalClimb': totalClimb,
        'averageSpeed': (totalDistance / totalTime) * 0.681818182,
        'inElastic': False
    }
    
    # Once we're done processing we need to set the time to be something that can be stored in JSON
    for index, obj in enumerate(data):
        timestamp = obj['time'].isoformat()
        obj['time'] = timestamp
            
    return data, rideData


## History

When ever this process runs it uses a Rides file that documents each ride. If a ride is listed in this file it means that it has already been processed and thus doesn't need processed again.

So first things first, let's load the rides file.

In [42]:
try:
    rides = json.loads(bucket['intermediate/rides.json'])
except:
    bucket[ridesKey] = json.dumps({})
    rides = {}

rideKeys = rides.keys()

print('There are {} rides that have been processed.'.format(len(rides.keys())))

There are 9 rides that have been processed.


## Raw Ride Files

Now that we have a list of files that have been processed we can check all of the files against it and process just the ones that need it.

In [43]:
objects = bucket.list(prefix='raw/sources/equilab/')

print('There are {} rides total.'.format(len(objects)))

There are 12 rides total.


In [44]:
for obj in objects:
    if obj.endswith('.gpx') and obj not in rideKeys:
        print('Processing {}'.format(obj))
        fileData, rideData = processFile(obj)
        
        newKey = obj.replace('raw', 'intermediate').replace('gpx', 'json')
        bucket[newKey] = json.dumps(fileData)
        
        rides[obj] = rideData
        processed = True

# If we've processed any data, then we need to save the rides data back to the bucket.
if processed:
    bucket[ridesKey] = json.dumps(rides)
else:
    print('Looks like no data was processed')


Processing raw/sources/equilab/training-2023-08-12.gpx
Processing raw/sources/equilab/training-2023-08-13.gpx


## Helper Functions

The below lines of code are meant as helper functions and in general they should be left commented out. But it's not always a bad thing to leave them uncommented and save them as history for the scheduled process that runs the notebook.


In [33]:
# View the rides JSON file
# bucket[ridesKey]


In [34]:
# Delete the rides file, which essentially will restart processing
# del bucket[ridesKey]

In [35]:
# List all of the ride data in the intermediate directory
# bucket.list(prefix='intermediate/sources/equilab/')


In [36]:
# Delete all of the processes rides
# for key in bucket.list(prefix='intermediate/sources/equilab/'):
#     del bucket[key]


In [45]:
# View a few lines from the last fileData processed
if processed:
    fileData[:3]


[{'time': '2023-08-13T20:34:08+00:00',
  'coords': {'long': -85.3560152, 'lat': 38.2852281},
  'elevation': 833.005276,
  'index': 0},
 {'time': '2023-08-13T20:34:09+00:00',
  'coords': {'long': -85.3560147, 'lat': 38.2852272},
  'elevation': 833.989528,
  'index': 1,
  'timeDelta': 1.0,
  'distance': 0.35780751633480307,
  'speed': 0.24395967029333074,
  'climb': 0.9842519999999695,
  'drop': None},
 {'time': '2023-08-13T20:34:10+00:00',
  'coords': {'long': -85.3560108, 'lat': 38.2852248},
  'elevation': 833.33336,
  'index': 2,
  'timeDelta': 1.0,
  'distance': 1.4202644144790146,
  'speed': 0.9683621010393763,
  'climb': None,
  'drop': 0.6561679999999797}]

In [47]:
# Print all of the rides that have been processed
if processed:
    list(rides.keys())[-5:]


['raw/sources/equilab/training-2023-11-05.gpx',
 'raw/sources/equilab/training-2023-11-18.gpx',
 'raw/sources/equilab/training-2023-11-19.gpx',
 'raw/sources/equilab/training-2023-08-12.gpx',
 'raw/sources/equilab/training-2023-08-13.gpx']

In [48]:
# Print the ride details for one of the rides above, by default it shows the lastest
if processed:
    rides[list(rides.keys())[-1]]


{'rawFile': 'raw/sources/equilab/training-2023-08-13.gpx',
 'interFile': 'intermediate/sources/equilab/training-2023-08-13.json',
 'rideDate': '20230813',
 'totalTime': 6250.003,
 'totalDistance': 29337.12579091715,
 'totalClimb': 6869.750875999966,
 'averageSpeed': 3.200412187301101,
 'inElastic': False}