# Extract Snowboard Data

This notebook is the first step in a project to extract and visualize GPS and biometric data collected while snowboarding. It will extract GPS and heart rate data collected from the Slopes app and from FitBit data exported from a Pixel Watch 2. There are two steps to process all of the data:

1. Python to extract the data
2. SAS for data merging and final data transformation

### Input Data
- __GPS__ - GPS data in .gpx format and GPS metadata in .slopes format
- __Heart Rate__ - Heart rate data in .json format exported from FitBit

### Output Data
All data is output in the `../stage` folder.

- __gps.parquet__ - GPS data
- __hr.parquet__ - Heart rate data
- __gps_meta.parquet__ - GPS metadata

### Folder Structure

To run this notebook successfully, it must be in a location with this folder structure:

```
[root]
   |
   |-----[stage]
   |-----[snowboarding]
               |--------- extract_snowboard_data.ipynb
               |--------- [data]
                            |----- [gps]
                            |----- [hr]
```

In [20]:
import pandas as pd
import json
import os
import xml.etree.ElementTree as ET
import requests
from zipfile import ZipFile
from dateutil import parser

# Set Raw Data Location

Expected to be in this current working directory with three folders:
- `/data` - Input data location
- `/data/hr` - Heart rate data location
- `/data/gps` - GPS data location

In [21]:
data_loc = os.path.join(os.getcwd(), 'data')
hr_data  = os.path.join(data_loc, 'hr')
gps_data = os.path.join(data_loc, 'gps')

## Read Heart Rate Data

Heart rate data is in JSON format. Variables include:
- `dateTime` - Date and time of the capture in UTC
- `bpm` - Beats Per Minute
- `confidence` - Confidence in accuracy of the reading. 0 = no reading could be made, 3 = highest quality signal.

Example of three points:

```
[{
  "dateTime" : "01/25/24 07:00:03",
  "value" : {
    "bpm" : 77,
    "confidence" : 2
  }
},
{
  "dateTime" : "01/25/24 07:00:08",
  "value" : {
    "bpm" : 73,
    "confidence" : 3
  }
},
{
  "dateTime" : "01/25/24 07:00:13",
  "value" : {
    "bpm" : 72,
    "confidence" : 2
  }
}]
```

In [22]:
df_list  = []
    
for json_file in os.listdir(hr_data):
    with open(os.path.join(hr_data, json_file)) as f:
        data = json.load(f)
        
    df = pd.json_normalize(data, sep='_')
    df.columns = df.columns.str.lower().str.replace('value_', '')
    df['datetime'] = ( pd.to_datetime(df['datetime'], format='%m/%d/%y %H:%M:%S', utc=True)
                         .dt.tz_localize(None)
                     )
    df_list.append(df)
        
df_hr = (
    pd.concat(df_list, ignore_index=True)
      .drop_duplicates(subset=['datetime'], ignore_index=True)
      .rename(columns={'datetime':'timestamp'})
      .sort_values('timestamp')
      .reset_index(drop=True)
)

## Read GPS Data

The GPS data is in GPX format. There are two namespaces we need to use:

1. The gpx namespace: http://www.topografix.com/GPX/1/1
2. The gte namespace http://www.gpstrackeditor.com/xmlschemas/General/1

We'll parse this as a standard XML using `ElementTree`.

Variables include:
- `name` - Name of the mountain
- `lat` - Latitude
- `lon` - Longitude
- `time` - Timestamp of point with offset
- `hdop` - Horizontal Dilution of Precision. Lower = better horizontal (lat/lon) accuracy.
- `vdop` - Vertical Dilution of Precision. Lower = better vertical (elevation) accuracy.
- `speed` - Speed in km/h. Part of extension in gte namespace.
- `azimuth` - Compass angle. Part of extension in gte namespace.

A sample of a single capture:

```
<trk>
  <name>Jan 25, 2024 - Keystone Resort</name>
  <trkseg>
    <trkpt lat="39.605675" lon="-105.941414">
      <ele>2856.891977</ele>
      <time>2024-01-25T09:13:52.453-07:00</time>
      <hdop>19</hdop>
      <vdop>4</vdop>
      <extensions>
        <gte:gps speed="1.317580" azimuth="212.300003"/>
      </extensions>
      </trkpt>
  </trkseg>
</trk>
```

In [23]:
gpx_namespace = '{http://www.topografix.com/GPX/1/1}'
gte_namespace = '{http://www.gpstrackeditor.com/xmlschemas/General/1}'
    
all_gps_data = []
file_list    = [file_name for file_name in os.listdir(gps_data) if file_name.endswith(".gpx")]
    
for gpx_file in file_list:
  with open(os.path.join(gps_data, gpx_file)) as f:
    root = ET.parse(f)
        
    for trkpt in root.findall(f'.//{gpx_namespace}trkpt'):

        time_elem = trkpt.find(f'{gpx_namespace}time')
        elev_elem = trkpt.find(f'{gpx_namespace}ele')
        gps_elem  = trkpt.find(f'.//{gpx_namespace}extensions/{gte_namespace}gps')

        row = {
                "timestamp": parser.parse(time_elem.text, ignoretz=True),
                "lat":       float(trkpt.get("lat")),
                "lon":       float(trkpt.get("lon")),
                "elevation": float(elev_elem.text),
                "speed":     float(gps_elem.get("speed")),
                "azimuth":   float(gps_elem.get("azimuth"))
              }
        
        all_gps_data.append(row)

df_gps = (
  pd.DataFrame(all_gps_data)
    .drop_duplicates(subset=['timestamp'], ignore_index=True)
    .sort_values('timestamp')
    .reset_index(drop=True)
)

## Read GPS Metadata

GPS metadata is within a `.slopes` file, which is a ZIP file. When we unzip it, we read `Metadata.xml`. It has a ton of variables. The main ones we want are:

- `start` - Start time of activity
- `end` - End time of activity
- `type` - Type of activity (Lift or Run)
- `numberOfType` - Which lift or run number the activity is for (e.g. first run, second lift, etc.)

An example of two GPS metadata points:

```
<Action start="2024-01-25 09:13:30 -0700" end="2024-01-25 09:24:29 -0700" type="Lift" numberOfType="1" …>
<Action start="2024-01-25 09:27:35 -0700" end="2024-01-25 09:32:36 -0700" type="Run" numberOfType="1" …>
```

In [24]:
df_list   = []
file_list = [file_name for file_name in os.listdir(gps_data) if file_name.endswith(".slopes")]
    
# .slopes files are just zip files with some CSVs and XML metadata.
# We just want to read Metadata.xml
for slopes_file in file_list:
    with ZipFile(os.path.join(gps_data, slopes_file), 'r') as zip_file:
        with zip_file.open('Metadata.xml') as xml_file:
            df = pd.read_xml(xml_file, parser='etree', xpath='.//Action')
            
    # Convert start/end to datetimes without the timezone
    df[['start', 'end']] = df[['start', 'end']].map(lambda x: parser.parse(x, ignoretz=True))
    df_list.append(df)
        
df_gps_meta = (
    pd.concat(df_list, ignore_index=True)
      .sort_values('start')
      .reset_index(drop=True)
)

# Output

Save all extracted data to the `../stage` folder in Parquet format for use in the next step: `transform_snowboard_data.sas`

In [25]:
out = '../stage'

df_hr.to_parquet(os.path.join(out, 'hr.parquet'))
df_gps.to_parquet(os.path.join(out, 'gps.parquet'))
df_gps_meta.to_parquet(os.path.join(out, 'gps_meta.parquet'))# Set this folder to your favorite output location