OpenStreetMap PBF Data Processing with Apache Spark
================================================

This notebook demonstrates how to process OpenStreetMap (OSM) PBF files using a custom Spark datasource.
It showcases efficient data loading and filtering techniques for large-scale geospatial data processing.

Step 1: Initialize Custom PBF Datasource
 --------------------------------------

In [None]:
%run "./load_datasource"

Step 2: Configure Data Paths
-------------------------
OSM PBF files can be downloaded for different regions from https://download.geofabrik.de/


In [None]:
path_lux = "/Volumes/timo/geospatial/osm/luxembourg-latest.osm.pbf"
path_fra = "/Volumes/timo/geospatial/osm/france-latest.osm.pbf"
path = "/Volumes/timo/geospatial/osm/andorra-latest.osm.pbf"

Step 3: Initialize Spark Session
----------------------------


In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Step 4: Load and Filter PBF Data
-----------------------------
IMPORTANT: The initial PBF reading occurs on a single node using osmium.
Performance Tips:
- Use a machine with sufficient memory and CPU
- Apply filters in the following order for optimal performance:
  1. emptyTagFilter: Removes entries with no tags
  2. keyFilter: Filters for specific OSM keys
  3. tagFilter: Filters for specific key-value pairs


In [None]:
df = (
    spark.read.format("pbf")
    .option("path", path)
    .option("geometryType", "WKT")
    .option("emptyTagFilter", True)
    .option("keyFilter", "building")
    .option("tagFilter", "('building', 'hospital')")
    .load()
)

Step 5: Display Results
--------------------

In [None]:
df.display()

Note: After this initial loading phase, the data is distributed across the Spark cluster
for further processing and analysis.