Best Practices guide for creation of good GeoParquet files (focused on distribution) #254

cholmes · 2025-01-09T22:01:00Z

Attempt to pull together recommendations / best practices as discussed in #251.

More work needed, feedback / help is very welcome. Likely more to discuss to get the recommendations right, but wanted to put up something for people to react to.

paleolimbot · 2025-03-13T15:58:07Z

format-specs/distributing-geoparquet.md

+### Sedona
+


Feel free to take out the comments (those were more for me writing this or for a future blog post).

@jiayuasu Is this about right?

Suggested change

### Sedona

### Sedona

```python

import glob

from sedona.spark import SedonaContext, GridType

from sedona.utils.structured_adapter import StructuredAdapter

from sedona.sql.st_functions import ST_GeoHash

# Configuring this line to do the right thing can be tricky

# https://sedona.apache.org/latest/setup/install-python/?h=python#prepare-sedona-spark-jar

config = (

SedonaContext.builder()

.config("spark.executor.memory", "6G")

.config("spark.driver.memory", "6G")

.getOrCreate()

)

sedona = SedonaContext.create(config)

# Read from GeoParquet or some other datasource + do any spatial ops/transformations

# using Sedona pyspark or SQL

df = sedona.read.format("geoparquet").load(

"/Users/dewey/gh/geoarrow-data/microsoft-buildings/files/microsoft-buildings_point_geo.parquet"

)

# Create the partitioning. KDBTREE provides a nice balance providing

# tight (but well-separated) partitions with approximately equal numbers of

# features in each file. Note that num_partitions is only a suggestion

# (actual value may differ)

rdd = StructuredAdapter.toSpatialRdd(df, "geometry")

rdd.analyze()

# We call the WithoutDuplicates() variant to ensure that we don't introduce

# duplicate features (i.e., each feature is assigned a single partition instead of

# each feature being assigned to every partition it intersects). For points the

# behaviour of spatialPartitioning() and spatialPartitioningWithoutDuplicates()

# is identical.

rdd.spatialPartitioningWithoutDuplicates(GridType.KDBTREE, num_partitions=8)

# Get the grids for this partitioning (you can reuse this partitioning

# by passing it to some other spatialPartitioningWithoutDuplicates() to

# ensure a different write has identical partition extents)

rdd.getPartitioner().getGrids()

df_partitioned = StructuredAdapter.toSpatialPartitionedDf(rdd, sedona)

# Optional: sort within partitions for tighter rowgroup bounding boxes within files

df_partitioned = (

df_partitioned.withColumn("geohash", ST_GeoHash(df_partitioned.geometry, 12))

.sortWithinPartitions("geohash")

.drop("geohash")

)

# Write in parallel directly from each executor node. This scales nicely to

# (much) bigger-than-memory data, particularly if done with a configured cluster

# (e.g., Databricks, Glue, Wherobots).

# There are several options for geoparquet writing:

# https://sedona.apache.org/latest/tutorial/files/geoparquet-sedona-spark/

df_partitioned.write.format("geoparquet").mode("overwrite").save(

"buildings_partitioned"

)

# The output files have funny names because Spark writes them this way

files = glob.glob("buildings_partitioned/*.parquet")

len(files)

cholmes and others added 7 commits December 12, 2024 16:48

initial outline

8c611b8

fleshed out compression and bbox

7eaf5a9

first, rough draft of distribution guide

1a796db

linted

83f9876

Added animated images

56da8ef

fleshed out examples more

b91386e

added duckdb, further refinements

5ee8b58

paleolimbot reviewed Mar 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best Practices guide for creation of good GeoParquet files (focused on distribution) #254

Best Practices guide for creation of good GeoParquet files (focused on distribution) #254

Uh oh!

cholmes commented Jan 9, 2025

Uh oh!

paleolimbot Mar 13, 2025

Uh oh!

jiayuasu Mar 13, 2025

Uh oh!

Uh oh!

-### Sedona
+### Sedona
+```python
+import glob
+from sedona.spark import SedonaContext, GridType
+from sedona.utils.structured_adapter import StructuredAdapter
+from sedona.sql.st_functions import ST_GeoHash
+# Configuring this line to do the right thing can be tricky
+# https://sedona.apache.org/latest/setup/install-python/?h=python#prepare-sedona-spark-jar
+config = (
+    SedonaContext.builder()
+    .config("spark.executor.memory", "6G")
+    .config("spark.driver.memory", "6G")
+    .getOrCreate()
+)
+sedona = SedonaContext.create(config)
+# Read from GeoParquet or some other datasource + do any spatial ops/transformations
+# using Sedona pyspark or SQL
+df = sedona.read.format("geoparquet").load(
+    "/Users/dewey/gh/geoarrow-data/microsoft-buildings/files/microsoft-buildings_point_geo.parquet"
+)
+# Create the partitioning. KDBTREE provides a nice balance providing
+# tight (but well-separated) partitions with approximately equal numbers of
+# features in each file. Note that num_partitions is only a suggestion
+# (actual value may differ)
+rdd = StructuredAdapter.toSpatialRdd(df, "geometry")
+rdd.analyze()
+# We call the WithoutDuplicates() variant to ensure that we don't introduce
+# duplicate features (i.e., each feature is assigned a single partition instead of
+# each feature being assigned to every partition it intersects). For points the
+# behaviour of spatialPartitioning() and spatialPartitioningWithoutDuplicates()
+# is identical.
+rdd.spatialPartitioningWithoutDuplicates(GridType.KDBTREE, num_partitions=8)
+# Get the grids for this partitioning (you can reuse this partitioning
+# by passing it to some other spatialPartitioningWithoutDuplicates() to
+# ensure a different write has identical partition extents)
+rdd.getPartitioner().getGrids()
+df_partitioned = StructuredAdapter.toSpatialPartitionedDf(rdd, sedona)
+# Optional: sort within partitions for tighter rowgroup bounding boxes within files
+df_partitioned = (
+    df_partitioned.withColumn("geohash", ST_GeoHash(df_partitioned.geometry, 12))
+    .sortWithinPartitions("geohash")
+    .drop("geohash")
+)
+# Write in parallel directly from each executor node. This scales nicely to
+# (much) bigger-than-memory data, particularly if done with a configured cluster
+# (e.g., Databricks, Glue, Wherobots).
+# There are several options for geoparquet writing:
+# https://sedona.apache.org/latest/tutorial/files/geoparquet-sedona-spark/
+df_partitioned.write.format("geoparquet").mode("overwrite").save(
+    "buildings_partitioned"
+)
+# The output files have funny names because Spark writes them this way
+files = glob.glob("buildings_partitioned/*.parquet")
+len(files)

Best Practices guide for creation of good GeoParquet files (focused on distribution) #254

Are you sure you want to change the base?

Best Practices guide for creation of good GeoParquet files (focused on distribution) #254

Uh oh!

Conversation

cholmes commented Jan 9, 2025

Uh oh!

paleolimbot Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

jiayuasu Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!