<span style='color:red'> NOTE: You can only pass the lab, when you provide both code and markdown </span>

Use Code for your analysis
Use Markdown to document and elaborate on your findings, conclusions, assertions, etc.

# Lab Work: Ananlyze Data With an Unknown File Format

This lab work should get you acquainted with the steps for analyzing data from unknown file formats. This often happens in reality, but in most cases - also in our case - these formats are wider spread than one might think. This also means, that there might be a ready made python package to work with the file format.

## 1. Check the file format

Use the file `unkown_file.gpx`.
Answer the following questions (giving justifications using either your own analysis with python or some other source, which you have to properly quote):
Answer directly under the bullet point.
* What is an `.gpx`file?
* How is it organized?
* What potential python support is available?
* Make an anylsis: Which python package would be the one that you chose?

### Answer 1:
A GPX file, also known as a GPS Exchange Format file, is simply a text file with geographic information such as waypoints, tracks, and routes saved in it.

Source: https://hikingguy.com/how-to-hike/what-is-a-gpx-file/


### Answer 2
A GPX file is made up of three main parts:

- **Waypoint (`wpt`)**:  
  A single GPS point with coordinates. It marks a specific location and can include extra info like a name or description.

- **Route (`rte`)**:  
  A planned path made of multiple waypoints (`rtept`). It shows where someone *wants* to go in the future.

- **Track (`trk`)**:  
  A recorded path of where someone *has already been*. It contains segments (`trkseg`), and each segment has points (`trkpt`) that trace the journey.

Source: https://en.wikipedia.org/wiki/GPS_Exchange_Format

### Answer 3
PyGPX would be a Python package that brings support for reading, writing and converting GPX files.
Source: https://pypi.org/project/gpx/

### Answer 4
gpxpy since it seems to be the most popular GPX library in python. So there would probably more resource that I can look into if I have a question regarding the use.

Source of the statement: https://pypi.org/project/fastgpx

Moreover it seems simple. https://pypi.org/project/gpxpy

## 2. Install a package of your choice for processing gpx files
Add all the other packages that you might need for your further analysis in this section

In [12]:
import pandas as pd
import numpy as np
import math as m
import gpxpy
import plotly.express as px


## 3. Read the data from the gpx file into a pandas dataframe
* Give proper names to the columns
* If the gpx file is hierarchical, provide a proper mapping in the dataframe (e.g. separate columns or hierarchical index)
* Use pythonic approaches wherever possible, i.e. use vectorized/broadcast functionality from numpy or pandas instead of writing loops

The code below, basically read and parse the gpx file and collect all the relevant data into a dataframe

In [13]:
with open('unkown_file.gpx', 'r') as f:
    gpx = gpxpy.parse(f)

data = [
    {
        'track': track.name,
        'latitude': point.latitude,
        'longitude': point.longitude,
        'elevation': point.elevation,
        'time': point.time
    }
    for track in gpx.tracks
    for s_idx, segment in enumerate(track.segments)
    for point in segment.points
]

df = pd.DataFrame(data)

df['time'] = pd.to_datetime(df['time'], errors='coerce', utc=True)


## 4 Visualize the data
* Provide an analysis of potential visualizations
* Which one is suited for the dataset
* Chose a proper visualization and visualize the data

### Scatter Map

Since we have latitude and longitude data, it's helpful to visualize how the whole of each track's path looks on a map, then we can add the elevation to it. Below are the corresponding visualizations, divided into Tour 1 and Tour 2.

In [14]:
def show_scatter_map(df):
    fig = px.scatter_map(
        df,
        lat="latitude",
        lon="longitude",
        zoom=13,
        color='elevation',
        color_continuous_scale = ['green', 'yellow', 'red'],
        height=600,
        title=f"GPS Tracks on Map for Track {df['track'].unique()}"
    )

    fig.update_layout(
        mapbox_style="basic",
        margin={"r":0,"t":30,"l":0,"b":0}
    )

    fig.show()


In [15]:
show_scatter_map(df[df['track']=="../tour_1.gpx"])

In [16]:
show_scatter_map(df[df['track']=="../tour_2.gpx"])

### Scatter Plot
Additionally, because we have elevation data, it's also useful to display it along the timeline, similar to what you might see in a hiking map or google map, showing where the path ascends and descends. Below are the corresponding visualizations, divided into Tour 1 and Tour 2.

In [17]:
import plotly.graph_objects as go

def show_elevation_overtime(df):
    # df[]
    df = df.sort_values('time')
    fig = px.scatter(
        df,
        x='time',
        y='elevation',
    )

    fig.update_layout(
        title=f"Elevation Profile Over Time for Track {df['track'].unique()}",
        xaxis_title='Time',
        yaxis_title='Elevation (m)',
        yaxis_range=[0, 1600],
        height=500
    )

    fig.show()

show_elevation_overtime(df[df['track']=="../tour_1.gpx"])
show_elevation_overtime(df[df['track']=="../tour_2.gpx"])

## 5 Print statistics on the data
Provide at least:
* Fundamental statistics per column
* Rowwise gradients (differences between elements)
* If there is a time column, provide derived measures like speed $\frac{ds}{dt} = \dot{s}$

### Fundamental Column Statistic

In [18]:
print("=== Basic Statistics Per Track ===")
print(df.groupby("track")[['latitude', 'longitude', 'elevation']].describe())

=== Basic Statistics Per Track ===
              latitude                                                        \
                 count       mean       std        min        25%        50%   
track                                                                          
../tour_1.gpx   1192.0  47.873080  0.007296  47.859334  47.867169  47.872850   
../tour_2.gpx   1007.0  50.715845  0.006810  50.701640  50.712238  50.715776   

                                    longitude            ...            \
                     75%        max     count      mean  ...       75%   
track                                                    ...             
../tour_1.gpx  47.880249  47.883705    1192.0  8.028455  ...  8.038022   
../tour_2.gpx  50.719717  50.729252    1007.0  5.978734  ...  5.991250   

                        elevation                                        \
                    max     count         mean         std          min   
track                                      

In [19]:
df

Unnamed: 0,track,latitude,longitude,elevation,time
0,../tour_1.gpx,47.860054,8.035913,1286.473413,2024-08-03 08:41:23.017000+00:00
1,../tour_1.gpx,47.859973,8.035832,1286.473413,2024-08-03 08:41:30+00:00
2,../tour_1.gpx,47.859987,8.035690,1286.473413,2024-08-03 08:41:39+00:00
3,../tour_1.gpx,47.859982,8.035551,1286.473413,2024-08-03 08:41:47+00:00
4,../tour_1.gpx,47.859910,8.035452,1286.473413,2024-08-03 08:41:58+00:00
...,...,...,...,...,...
2194,../tour_2.gpx,50.716523,6.002066,201.393590,2025-03-09 11:19:25.999000+00:00
2195,../tour_2.gpx,50.716592,6.002225,201.393590,2025-03-09 11:19:28.999000+00:00
2196,../tour_2.gpx,50.716653,6.002400,201.393590,2025-03-09 11:19:31.998000+00:00
2197,../tour_2.gpx,50.716706,6.002568,201.393590,2025-03-09 11:19:34.998000+00:00


In [20]:
from haversine import haversine, Unit
import numpy as np

def compute_gradients(track_df):
    track_df = track_df.sort_values("time").reset_index(drop=True)

    prev_points = list(zip(track_df['latitude'].shift(), track_df['longitude'].shift()))
    curr_points = list(zip(track_df['latitude'], track_df['longitude']))

    distances = [
        haversine(p1, p2, unit=Unit.METERS) if None not in (p1, p2) else np.nan
        for p1, p2 in zip(prev_points, curr_points)
    ]

    track_df['distance'] = distances

    track_df['elev_diff'] = track_df['elevation'] - track_df['elevation'].shift()
    track_df['gradient'] = track_df['elev_diff'] / track_df['distance']

    return track_df

df = df.groupby('track', group_keys=True).apply(compute_gradients, include_groups=False).reset_index(drop=False)
df


Unnamed: 0,track,level_1,latitude,longitude,elevation,time,distance,elev_diff,gradient
0,../tour_1.gpx,0,47.860054,8.035913,1286.473413,2024-08-03 08:41:23.017000+00:00,,,
1,../tour_1.gpx,1,47.859973,8.035832,1286.473413,2024-08-03 08:41:30+00:00,10.846247,0.0,0.0
2,../tour_1.gpx,2,47.859987,8.035690,1286.473413,2024-08-03 08:41:39+00:00,10.707782,0.0,0.0
3,../tour_1.gpx,3,47.859982,8.035551,1286.473413,2024-08-03 08:41:47+00:00,10.385091,0.0,0.0
4,../tour_1.gpx,4,47.859910,8.035452,1286.473413,2024-08-03 08:41:58+00:00,10.892630,0.0,0.0
...,...,...,...,...,...,...,...,...,...
2194,../tour_2.gpx,1002,50.716523,6.002066,201.393590,2025-03-09 11:19:25.999000+00:00,14.332752,0.0,0.0
2195,../tour_2.gpx,1003,50.716592,6.002225,201.393590,2025-03-09 11:19:28.999000+00:00,13.571199,0.0,0.0
2196,../tour_2.gpx,1004,50.716653,6.002400,201.393590,2025-03-09 11:19:31.998000+00:00,14.064381,0.0,0.0
2197,../tour_2.gpx,1005,50.716706,6.002568,201.393590,2025-03-09 11:19:34.998000+00:00,13.214732,0.0,0.0


In [21]:
df

Unnamed: 0,track,level_1,latitude,longitude,elevation,time,distance,elev_diff,gradient
0,../tour_1.gpx,0,47.860054,8.035913,1286.473413,2024-08-03 08:41:23.017000+00:00,,,
1,../tour_1.gpx,1,47.859973,8.035832,1286.473413,2024-08-03 08:41:30+00:00,10.846247,0.0,0.0
2,../tour_1.gpx,2,47.859987,8.035690,1286.473413,2024-08-03 08:41:39+00:00,10.707782,0.0,0.0
3,../tour_1.gpx,3,47.859982,8.035551,1286.473413,2024-08-03 08:41:47+00:00,10.385091,0.0,0.0
4,../tour_1.gpx,4,47.859910,8.035452,1286.473413,2024-08-03 08:41:58+00:00,10.892630,0.0,0.0
...,...,...,...,...,...,...,...,...,...
2194,../tour_2.gpx,1002,50.716523,6.002066,201.393590,2025-03-09 11:19:25.999000+00:00,14.332752,0.0,0.0
2195,../tour_2.gpx,1003,50.716592,6.002225,201.393590,2025-03-09 11:19:28.999000+00:00,13.571199,0.0,0.0
2196,../tour_2.gpx,1004,50.716653,6.002400,201.393590,2025-03-09 11:19:31.998000+00:00,14.064381,0.0,0.0
2197,../tour_2.gpx,1005,50.716706,6.002568,201.393590,2025-03-09 11:19:34.998000+00:00,13.214732,0.0,0.0


In [22]:
df['time'] = pd.to_datetime(df['time'])

def compute_speed(track_df):
    track_df = track_df.sort_values("time").reset_index(drop=True)
    track_df['time_diff'] = (track_df['time'] - track_df['time'].shift()).dt.total_seconds()
    track_df['speed_mps'] = track_df['distance'] / track_df['time_diff']
    return track_df

# df = df.groupby("track", group_keys=False).apply(compute_speed).reset_index()
df = df.groupby('track', group_keys=True).apply(compute_gradients, include_groups=False).reset_index(drop=False)

df = df.groupby("track", group_keys=True).apply(compute_speed, include_groups=False).reset_index(drop=False)
df

ValueError: cannot insert level_1, already exists

In [None]:
fig_speed = px.line(
    df[df.index == "../tour_1.gpx"], 
    x='time', 
    y='speed_mps', 
    color='track',
    title='Speed over Time',
    labels={'speed_mps': 'Speed (m/s)', 'time': 'Time'}
)

fig_speed.show()

The code below try to show the basic statistic of each column

In [None]:
def show_basic_statistics_by_track(df):
    numeric_cols = ['latitude', 'longitude', 'elevation']

    for track_name in df['track'].unique():
        track_df = df[df['track'] == track_name]
        print(f"\n=== Basic Statistics for Track: {track_name} ===")
        print(track_df[numeric_cols].describe())

show_basic_statistics_by_track(df)


we can see from the result above, a clear contrast between the two tours. Tour 1 (../tour_1.gpx) takes place in a noticeably higher region, with elevations ranging from around 1089 meters up to nearly 1494 meters. The standard deviation of over 137 meters suggests that this route includes significant climbs and descents, while Tour 2 is much flatter, with elevation mostly hovering around 190 meters and a much smaller variation

### Row Wise Gradient

This part calculates the distance between consecutive GPS points using the Haversine formula and then computes the elevation difference and gradient (slope) between those points. The gradient is defined as the elevation change divided by the horizontal distance, helping to identify how steep a segment is. Finally, it prints summary statistics for the gradient, giving insight into how hilly or flat the route is overall.

In [None]:
# Install if not already installed:
# pip install haversine

from haversine import haversine, Unit

def row_wise_gradient(df):
    # Compute rowwise distance and elevation difference
    df['prev_coords'] = list(zip(df['latitude'].shift(), df['longitude'].shift()))
    df['curr_coords'] = list(zip(df['latitude'], df['longitude']))

    # Apply haversine distance in meters
    df['distance'] = df.apply(lambda row: haversine(row['prev_coords'], row['curr_coords'], unit=Unit.METERS), axis=1)

    # Elevation difference and gradient
    df['elev_diff'] = df['elevation'] - df['elevation'].shift()
    df['gradient'] = df['elev_diff'] / df['distance']

    # Output
    print("\n=== Gradient Stats ===")
    print(df['gradient'].describe())

row_wise_gradient(df[df['track']=="../tour_1.gpx"].copy())
row_wise_gradient(df[df['track']=="../tour_2.gpx"].copy())


### Derived Measures: Speed & Its Comparison
This function calculates and compares the speed statistics for each GPS track by computing the distance and time differences between consecutive points. It then derives the speed (in meters per second) and prints a summary of descriptive statistics (like mean, min, max, etc.) for each track.

In [None]:
def compare_speed_statistics(df):
    df['time'] = pd.to_datetime(df['time'])
    df = df.sort_values(by='time')

    df['time_diff'] = df.groupby('track')['time'].diff().dt.total_seconds()
    df['distance'] = haversine(
        df['latitude'].shift(), df['longitude'].shift(),
        df['latitude'], df['longitude']
    )
    df['speed'] = df['distance'] / df['time_diff']

    df_clean = df.dropna(subset=['speed'])

    stats = df_clean.groupby('track')['speed'].describe().T
    print("\n=== Speed Statistics (m/s) per Track ===")
    print(stats)

compare_speed_statistics(df)
