![](https://wherobots.com/wp-content/uploads/2023/12/Inline-Blue_Black_onWhite@3x.png)

# WherobotsAI Map Matching Example

In this notebook we introduce Wherobots Map Matching, a library for creating map applications with large scale geospatial data, and explore the task of matching noisy GPS trajectory data to underlying road segments using OpenStreetMap road network data. [Read more about Wherobots Map Matching in the Wherobots documentation.](https://docs.wherobots.com/latest/tutorials/sedonamaps/introduction/)

In [None]:
import json
from shapely.geometry import LineString
from pyspark.sql.window import Window
from pyspark.sql.functions import col, expr, udf, collect_list, struct, row_number, lit
from sedona.spark import *

## Define Sedona context

In [None]:
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

## Map Matching
Map matching is a crucial step in many transportation analyses. It involves aligning a sequence of observed user positions (usually from GPS) onto a digital map, identifying the most likely path or sequence of roads that a user has traversed. 

In this section, we will use Wherobots Map Matching for our map matching tasks.

### Load Ann Arbor, Michigan Road Network Data from OSM File into Spatial Dataframe
We are utilizing the OpenStreetMap (OSM) data specific to the Ann Arbor, Michigan region to provide the foundational road network for our analysis. OpenStreetMap offers detailed and open-sourced road network data, making it a prime choice for transportation studies.

The step load_OSM is executed only once to load this road network data. Given the granularity and detail of OSM datasets, this process might take some time.
<br><br>

In [None]:
from wherobots import matcher
dfEdge = matcher.load_osm("s3://wherobots-examples/data/osm_AnnArbor_large.xml", "[car]")
dfEdge.show(5)

### Load GPS Tracks Data from VED Dataset
For this analysis, we're leveraging the Vehicle Energy Dataset (VED). VED is a comprehensive dataset capturing GPS trajectories of 383 vehicles (including gasoline vehicles, HEVs, and PHEV/EVs) in Ann Arbor, Michigan, USA, from Nov 2017 to Nov 2018. The data spans ~374,000 miles and includes details on fuel, energy, speed, and auxiliary power usage. Driving scenarios cover diverse conditions, from highways to traffic-dense downtown areas, across different seasons.

Source: "Vehicle Energy Dataset (VED), A Large-scale Dataset for Vehicle Energy Consumption Research" by Geunseob (GS) Oh, David J. LeBlanc, Huei Peng. Published in IEEE Transactions on Intelligent Transportation Systems (T-ITS), 2020.

GitHub: https://github.com/gsoh/VED
<br><br>

In [None]:
df = sedona.read.csv("s3://wherobots-examples/data/VED_171101_week.csv", header=True, inferSchema=True)

<br>For the purpose of this analysis, we are specifically extracting the columns representing the vehicle id, trip id, timestamp, latitude, and longitude. Each row in the dataset represents a spatial-temporal point of a vehicle's journey, with columns detailing:

**VehId**: Vehicle Identifier.<br>
**Trip**: Trip Identifier for a vehicle. It helps distinguish between different journeys of the same vehicle.<br>
**Timestamp(ms)**: Timestamp of the data point, typically represented in milliseconds.<br>
**Latitude[deg]**: Latitude coordinate of the vehicle at the given timestamp.<br>
**Longitude[deg]**: Longitude coordinate of the vehicle at the given timestamp.
<br><br>

In [None]:
df = df.select(['VehId', 'Trip', 'Timestamp(ms)','Latitude[deg]', 'Longitude[deg]'])

In [None]:
df.show(10)

<br>The combination of VehId and Trip together form a unique key for our dataset. This combination allows us to isolate individual vehicle trajectories. Every unique pair signifies a specific trajectory of a vehicle. Raw GPS points, while valuable, can be scattered, redundant, and lack context when viewed independently. By organizing these individual points into coherent trajectories represented by Linestrings, we enhance our ability to interpret, analyze, and apply the data in meaningful ways.

### Create LineString Geometries from GPS tracks

A groupBy operation is performed on 'VehId' and 'Trip' columns to isolate individual trajectories. The resulting LineString essentially captures the responding vehicle's trajectory over time. The rows are first sorted by their timestamps to ensure the LineString follows the chronological order of the GPS data points.

A User Defined Function (UDF) is created for Spark that utilizes the function below to process Spatial DataFrame rows into LineString geometries.
<br><br>

In [None]:
def rows_to_linestring(rows):
    sorted_rows = sorted(rows, key=lambda x: x['Timestamp(ms)'])
    coords = [(row['Longitude[deg]'], row['Latitude[deg]']) for row in sorted_rows]
    linestring = LineString(coords)
    return linestring

linestring_udf = udf(rows_to_linestring, GeometryType())

In [None]:
# Group by VehId and Trip and aggregate
dfPath = (df
          .groupBy("VehId", "Trip")
          .agg(collect_list(struct("Timestamp(ms)", "Latitude[deg]", "Longitude[deg]")).alias("coords"))
          .withColumn("geometry", linestring_udf("coords"))
         )

### Create a Spatial DataFrame of GPS Tracks

In [None]:
# Using row_number to generate unique IDs
window_spec = Window.partitionBy(lit(5)).orderBy("VehId", "Trip")  # Ordering by existing columns to provide some deterministic order
dfPath = dfPath.withColumn("ids", row_number().over(window_spec) - 1)
dfPath = dfPath.filter(dfPath['ids'] < 10)
dfPath = dfPath.select("ids", "VehId", "Trip", "coords", "geometry")
dfPath.show()

## Perform Map Matching

In [None]:
sedona.conf.set("wherobots.tools.mm.maxdist", "100")
sedona.conf.set("wherobots.tools.mm.maxdistinit", "100")
sedona.conf.set("wherobots.tools.mm.obsnoise", "40")

dfMmResult = matcher.match(dfEdge, dfPath, "geometry", "geometry")

<br>The dataframe showcases the results of a map matching process on GPS trajectories:

**ids**: A unique identifier for each trajectory, representing a distinct vehicle journey.<br>
**observed_points**: Represents the original GPS trajectories. These are the linestrings formed from the raw GPS points collected during each vehicle journey.<br>
**matched_points**: The processed trajectories post map-matching. These linestrings are aligned onto the actual road network, correcting for any GPS inaccuracies.<br>
**matched_nodes**: A list of node identifiers from the road network that the matched trajectory passes through. These nodes correspond to intersections, turns, or other significant points in the road network.
<br><br>

In [None]:
dfMmResult.show()

## Visualize the result using SedonaKepler

In [None]:
with open('conf/map_config.json', 'r') as file:
    map_config = json.load(file)

In [None]:
mapAll = SedonaKepler.create_map()

SedonaKepler.add_df(mapAll, dfEdge, name="Road Network")
SedonaKepler.add_df(mapAll, dfMmResult.selectExpr("observed_points AS geometry"), name="Observed Points")
SedonaKepler.add_df(mapAll, dfMmResult.selectExpr("matched_points AS geometry"), name="Matched Points")
mapAll.config = map_config

mapAll

<br>In this visualization, we are focusing on displaying the data corresponding to 'id' value 2. To visualize data for a different 'id' value, simply change the filter condition to the desired 'id' value.
<br><br>

In [None]:
mapFil = SedonaKepler.create_map()

SedonaKepler.add_df(mapFil, dfEdge, name="Road Network")
SedonaKepler.add_df(mapFil, dfMmResult.filter(dfMmResult['ids']==2).selectExpr("observed_points AS geometry"), name="Observed Points")
SedonaKepler.add_df(mapFil, dfMmResult.filter(dfMmResult['ids']==2).selectExpr("matched_points AS geometry"), name="Matched Points")
mapFil.config = map_config

mapFil